[
https://issues.apache.org/jira/browse/MAPREDUCE-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15159319#comment-15159319
]
Junping Du commented on MAPREDUCE-6608:
---------------------------------------
Thanks [~srikanth.sampath] for updating the design doc and uploading an
outstanding demo patch!
Sorry for reply a little late as just come back from a vacation... Finally, I
got chance to review the latest document and the demo patch.
+1 on Vinod's proposal of separating write and read path. This solution is even
better than my proposal (HDFS way) above as no single point access means better
scalability. The only problem here is the implementation is more complicated as
it involves new RPC service in NM (client side is task) and more payload
between NM-RM heartbeat, so we should separate it out a dedicated YARN JIRA to
track the work.
Other quick comments on the design doc:
bq. The work preserving feature of an MR Application can be set at an
application level, when the application is submitted.
Sounds good. We can involve a new MR config to switch on/off this feature (off
by default). However, I didn't see any implementation on this in demo patch and
I think we should add it in the beginning as we want to keep old behavior (code
path) unchanged in case feature is off.
bq. When the AM starts up, the registry operations is started as a service. An
AM creates a service record id being the JobId and persistence being at the
application level. It then stores the address(host, port) as an internal
endpoint.
Beside we need to replace the read path of registry service, another point is
we don't necessary to keep the first attempt AM info which could saving most of
overhead we are adding here as most applications are expected to end with
single attempt. Isn't it?
bq. Currently, YarnChild uses positional arguments as parameters. This will be
enhanced to use named arguments as parameters. For work preserving jobs, the
path to the service record is passed as the parameter to determine the address
of the AM.
Agree that named argument sounds better. However, this way has work for a long
time for MapReduce project and we won't prefer to change unless we find some
issue or bug. For path to service record, we need keep consistent with our
decision on read path.
bq. Thus UmbilicalWithRetries is a wrapper over Umbilical with retries
implemented. Depending on whether the AM is workpreserving or not, a factory
method creates either a vanilla umbilical or one with retries.
UmbilicalWithRetries should follow other existing practice (for RPC client
retry during service down time) that to create a RetryProxy with
FailoverProxyProvider (may be call it as MRAMProxy) for task attempt to contact
with new attempt instance for AM.
TaskManagement part look good to me.
> Work Preserving AM Restart for MapReduce
> ----------------------------------------
>
> Key: MAPREDUCE-6608
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6608
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Reporter: Srikanth Sampath
> Assignee: Srikanth Sampath
> Attachments: Patch1.patch, WorkPreservingMRAppMaster-1.pdf,
> WorkPreservingMRAppMaster-2.pdf, WorkPreservingMRAppMaster.pdf
>
>
> Providing a framework for work preserving AM is achieved in
> [YARN-1489|https://issues.apache.org/jira/browse/YARN-1489]. We would like
> to take advantage of this for MapReduce(MR) applications. There are some
> challenges which have been described in the attached document and few options
> discussed. We solicit feedback from the community.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)