[
https://issues.apache.org/jira/browse/SAMZA-750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15437657#comment-15437657
]
Jon Bringhurst commented on SAMZA-750:
--------------------------------------
We've spent some time testing this in a few development clusters (Samza >= 0.10
and Yarn Server 2.7.1) and it appears to work correctly. Here's a few notes
from [~spvenkat]'s testing:
{quote}
h3. Setup
* One job with a container count of 2
* A second job with host affinity enabled and a container count of 32
RM’s were killed in all the cases using SIGTERM. Processes were killed abruptly
to check if a controlled shutdown of RM is required for proper functioning of
work preserving feature of YARN. Status of the jobs were monitored using the RM
UI.
h3. When work preserving RM config is turned on:
After successive failover to standby resource manager, epoch number gets added
to container id of newly submitted jobs. Container id of running jobs will not
undergo any change after the failover. Before the failover to standby resource
manager, the container id of jobs and tasks will not contain the epoch number.
Parsing container ids of these jobs with an epoch number have been supported
from YARN version 2.6.0/2.6.1. SAMZA-750 (Samza 9 was the latest release at
that time). Parsing container IDs with epoch information failed in the
application master due to the YARN 2.4 dependency in Samza.
The following was used to test the YARN Work preserving RM feature:
* Samza 10 with YARN 2.6.1
* Samza 11 with YARN 2.6.1
* Samza 11 with YARN 2.7.1
In all those runs, all NM’s correctly failed over to the passive RM when the
active RM was killed. Our tests show that the work preserving RM feature is
supported for jobs that are running with samza 10 or higher.
Fail over happens almost immediately after the active RM was killed. However if
we submit new jobs before RM failover is complete, it starts to run after
yarn.resourcemanager.work-preserving-recovery.scheduling-wait-ms config value
time.
h3. Scenarios tested:
Basic recovery: Kill the active RM and expect the passive RM to take over. When
this happens, task containers and NM do not restart. All statistics related to
the job are preserved as well.
Container ID format: Yarn container ID is of the following form
container_ClusterTimeStamp_ApplicationId_AttempId_ContainerId. After a
successful switch of a RM standby with work preserving enabled, newly submitted
jobs and restarted jobs end up having an epoch in their container id string. To
parse these container IDs, yarn java client with a version greater than 2.6.1
is required.
Restart both active and stand by RM at the same time. Jobs and tasks containers
are restarted when work preserving is enabled.
Kill the active resource manager and kill an active task container before the
fail over is complete. After fail over, new resource manager detects that a
container has been terminated and ends up requesting a new container. In cases
of stateful jobs (host affinity enabled), upon killing a task container before
failover is complete, it requests a new container from the same node manager on
which the job was running.
Kill everything (RM,NM etc). This ends up restarting the jobs when all
processes comes back up.
{quote}
> Run YARN RM Recovery test to uncover any potential issues with SamzaAppMaster
> -----------------------------------------------------------------------------
>
> Key: SAMZA-750
> URL: https://issues.apache.org/jira/browse/SAMZA-750
> Project: Samza
> Issue Type: Test
> Components: yarn
> Affects Versions: 0.10.0
> Reporter: Yi Pan (Data Infrastructure)
> Assignee: Shanthoosh Venkataraman
>
> Currently, there is no tests toward YARN RM Recovery support in Samza.
> As pointed out by [~llamahunter], there is likely some issue in containerID
> versioning in SamzaAppMaster to handle RM Recovery case. There might be more
> issues.
> This JIRA is to track the effort to uncover the issues related to YARN RM
> Recovery feature. The outcome expected is:
> 1) Some test suite that runs Samza jobs in YARN with RM Recovery
> 2) A list of issues discovered. Each issue may result in a separate JIRA to
> be fixed.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)