[jira] [Commented] (SAMZA-750) Run YARN RM Recovery test to uncover any potential issues with SamzaAppMaster

Jon Bringhurst (JIRA) Thu, 25 Aug 2016 14:00:35 -0700

    [ 
https://issues.apache.org/jira/browse/SAMZA-750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15437657#comment-15437657
 ]


Jon Bringhurst commented on SAMZA-750:
--------------------------------------

We've spent some time testing this in a few development clusters (Samza >= 0.10 
and Yarn Server 2.7.1) and it appears to work correctly. Here's a few notes 
from [~spvenkat]'s testing:

{quote}
h3. Setup

* One job with a container count of 2
* A second job with host affinity enabled and a container count of 32

RM’s were killed in all the cases using SIGTERM. Processes were killed abruptly 
to check if a controlled shutdown of RM is required for proper functioning of 
work preserving feature of YARN. Status of the jobs were monitored using the RM 
UI.

h3. When work preserving RM config is turned on: 

After successive failover to standby resource manager, epoch number gets added 
to container id of newly submitted jobs. Container id of running jobs will not 
undergo any change after the failover. Before the failover to standby resource 
manager, the container id of jobs and tasks will not contain the epoch number.
 
Parsing container ids of these jobs with an epoch number have been supported 
from YARN version 2.6.0/2.6.1. SAMZA-750 (Samza 9 was the latest release at 
that time). Parsing container IDs with epoch information failed in the 
application master due to the YARN 2.4 dependency in Samza.

The following was used to test the YARN Work preserving RM feature:

* Samza 10 with YARN 2.6.1 
* Samza 11 with YARN 2.6.1 
* Samza 11 with YARN 2.7.1

In all those runs, all NM’s correctly failed over to the passive RM when the 
active RM was killed. Our tests show that the work preserving RM feature is 
supported for jobs that are running with samza 10 or higher.

Fail over happens almost immediately after the active RM was killed. However if 
we submit new jobs before RM failover is complete, it starts to run after 
yarn.resourcemanager.work-preserving-recovery.scheduling-wait-ms config value 
time.

h3. Scenarios tested:

Basic recovery: Kill the active RM and expect the passive RM to take over. When 
this happens, task containers and NM do not restart. All statistics related to 
the job are preserved as well.

Container ID format: Yarn container ID is of the following form 
container_ClusterTimeStamp_ApplicationId_AttempId_ContainerId. After a 
successful switch of a RM standby with work preserving enabled, newly submitted 
jobs and restarted jobs end up having an epoch in their container id string. To 
parse these container IDs, yarn java client with a version greater than 2.6.1 
is required.

Restart both active and stand by RM at the same time. Jobs and tasks containers 
are restarted when work preserving is enabled.

Kill the active resource manager and kill an active task container before the 
fail over is complete. After fail over, new resource manager detects that a 
container has been terminated and ends up requesting a new container. In cases 
of stateful jobs (host affinity enabled), upon killing a task container before 
failover is complete, it requests a new container from the same node manager on 
which the job was running.

Kill everything (RM,NM etc). This ends up restarting the jobs when all 
processes comes back up.
{quote}

> Run YARN RM Recovery test to uncover any potential issues with SamzaAppMaster
> -----------------------------------------------------------------------------
>
>                 Key: SAMZA-750
>                 URL: https://issues.apache.org/jira/browse/SAMZA-750
>             Project: Samza
>          Issue Type: Test
>          Components: yarn
>    Affects Versions: 0.10.0
>            Reporter: Yi Pan (Data Infrastructure)
>            Assignee: Shanthoosh Venkataraman
>
> Currently, there is no tests toward YARN RM Recovery support in Samza.
> As pointed out by [~llamahunter], there is likely some issue in containerID 
> versioning in SamzaAppMaster to handle RM Recovery case. There might be more 
> issues.
> This JIRA is to track the effort to uncover the issues related to YARN RM 
> Recovery feature. The outcome expected is:
> 1) Some test suite that runs Samza jobs in YARN with RM Recovery
> 2) A list of issues discovered. Each issue may result in a separate JIRA to 
> be fixed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (SAMZA-750) Run YARN RM Recovery test to uncover any potential issues with SamzaAppMaster

Reply via email to