[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13478005#comment-13478005
 ] 

Robert Joseph Evans commented on MAPREDUCE-4495:
------------------------------------------------

bq. On *How will the WFAM handle itself crashing…*: ...
That is great but I think I am more curious about restarting the child AMs.  
Will they be restarted just like how the RM currently does? I assume so, I just 
wanted to be sure.

bq. On *If the WFAM does restart after a crash will it try to reestablish 
communication with App Masters*, the WFAM would be able to reconnect with 
children AMs without any issue as they would have continued working without 
knowing that the parent WFAM got restarted, it would use just their async 
client APIs.
The concept is great, I think that MR originally had that concept to 
reestablish communication with its tasks too, but has not delivered on it 
because it requires the tasks to have a way to update the URL that they will 
heartbeat into. Not super hard, but this would be the first AM to do it.

bq. On *How will the WFAM schedule containers*... reuse containers when it 
makes sense. This would be possible when the WFAM is using embedded AMs (ie an 
MRAM to run an MR job) and the embedded AMs support injection of Container 
implementations (ie to replace the default container allocation)
My point is that just replacing the default container allocator is not 
sufficient.  You also have to provide a way to launch the container once it is 
allocated.  The container launch context provides several things, dist cache in 
particular, that may be nearly impossible to do without giving YARN the ability 
to relaunch a container with a different context. 

bq. On *How do you decided which AM etc has a higher priority…*, we are 
constrained by a DAG, thus current DAG nodes get to run before upcoming ones.
I get that you are constrained by the DAG, but even in your DAG you can have 
two AMs running in parallel.  When a container request comes back which AM do 
you give that container to?  Will it be FIFO? The first AM to launch gets all 
containers until its outstanding requests have been satisfied?  Will it be more 
of a shortest distance calculation?  The mem/cpu resources match for both AM1 
and AM2.  The container is on foo1.rack1.com. AM1 wants something on 
foo2.rack1.com and AM2 wants something on foo1.rack2.com so I will give it to 
AM1, because it is closer.

bq. On *How security going to be handled?*, no different from how is handled in 
MRAM.
The MRAM currently does not do anything to allow for clients to securely 
connect through RPC to arbitrary containers running as part of that Job.  
Currently the MR client establishes a connection to the RM to get a token and a 
single URL to use to communicate with the AM.  Will the WfAM act like an RM and 
hand out tokens/URLs to clients so they can then connect to the child AMs? or 
will it just pass the secret it was given to the child AMs so the original RM 
token works for all of the child AMs, and then how will the client get the URL 
to communicate with the child AM?  Will RPC communication with the Child AMs 
just not be allowed?

bq. For Pig / Hive collaboration I think we should be focusing on lower level 
issues, like sorting out the details of hierarchical scheduling and sorting out 
how we can really reduce the launch latency of simple jobs to near zero. Every 
framework will benefit from that!

I am totally +1 on reducing job launch latency.  I know that Rob Parker and 
Jason Lowe have been doing some profiling of that, and looking for low hanging 
fruit, but there is always need for more eyes.  I would add to that we probably 
also want to make sure that Pig and Hive can take advantage of the Uber AM, 
even to the point changing the resource request/heap size of the AM in the 
client dynamically so that more jobs can be run as Uber and without fear of 
getting OOMs.

                
> Workflow Application Master in YARN
> -----------------------------------
>
>                 Key: MAPREDUCE-4495
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4495
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>    Affects Versions: 2.0.0-alpha
>            Reporter: Bo Wang
>            Assignee: Bo Wang
>         Attachments: MAPREDUCE-4495-v1.1.patch, MAPREDUCE-4495-v1.patch, 
> MapReduceWorkflowAM.pdf, yapp_proposal.txt
>
>
> It is useful to have a workflow application master, which will be capable of 
> running a DAG of jobs. The workflow client submits a DAG request to the AM 
> and then the AM will manage the life cycle of this application in terms of 
> requesting the needed resources from the RM, and starting, monitoring and 
> retrying the application's individual tasks.
> Compared to running Oozie with the current MapReduce Application Master, 
> these are some of the advantages:
>  - Less number of consumed resources, since only one application master will 
> be spawned for the whole workflow.
>  - Reuse of resources, since the same resources can be used by multiple 
> consecutive jobs in the workflow (no need to request/wait for resources for 
> every individual job from the central RM).
>  - More optimization opportunities in terms of collective resource requests.
>  - Optimization opportunities in terms of rewriting and composing jobs in the 
> workflow (e.g. pushing down Mappers).
>  - This Application Master can be reused/extended by higher systems like Pig 
> and hive to provide an optimized way of running their workflows.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to