[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13765889#comment-13765889
 ] 

Hitesh Shah commented on MAPREDUCE-4421:
----------------------------------------

[~jlowe] Had a few questions/comments related to the implementation/patch: 

- Why does classpath need to include all of common, hdfs and yarn jar 
locations? Assuming that MR is running on a YARN-based cluster, shouldn't the 
location of the core dependencies come from the cluster deployment i.e. via the 
env that the NM sets for a container. I believe the only jars that MR should 
have in its uploaded tarball should be the client jars. I understand that there 
is no clear boundary for client-side only jars for common and hdfs today ( for 
For YARN, I believe it should be simple to split out the client-side 
requirements ) but it is something we should aim for or assume that the jars 
deployed on the cluster are compatible. 
  - I guess the underlying question is why use the full hadoop tarball and not 
just the mapreduce-only tarball? If MR is trully a user-land library, it should 
be treated as such and have a separate deployment approach.

- I would vote to make the tar-ball in HDFS be the only way to run MR on YARN. 
Obviously, this cannot be done for 2.x but we should move to this model on 
trunk and not support the current approach at all there. Comments? 

- The other point is related to configs. Configuration still loads mapred-site 
and mapred-default files and new Configuration objects are created on the 
cluster. Are these files still expected on the cluster? job.xml does override 
these but cluster configs could still have final params. If this is meant to be 
addressed in a follow-up jira to ensure all MR configs come from the client, 
you can ignore this point for now.

- How do you see framework name extracted from the path to be used? Is it just 
a safety check to ensure that it is found in the classpath? Will it have any 
relation to a version? A minor nit - framework name seems confusing in relation 
to the framework name in use from earlier i.e yarn vs local framework. 

- Description in the default-xml for mapreduce.application.framework.path does 
not mention the need for the URI fragment and how the fragment is used as a 
sanity check to the classpath. 

- Regarding versions, it seems like users will need to do 2 things. Change the 
location of the tarball on HDFS and modify the classpath. Users will need to 
know the exact structure of the classpath. In such a scenario, do defaults even 
make sense? On the other hand, if we define a common standard i.e. a base path 
for all MR tarballs, with each tarball in a defined structure  ( possibly with 
version info added on later on for the code to infer the structure of the 
tarball ), all the user would need to do is specify the base path ( which could 
have a default value ) and a version which again has a default value. The 
latter approach would require the code to construct the necessary classpath if 
the upload path is in use. Do you have any comments on which of the 2 
approaches makes more sense? The former is way more flexible but a bit more 
complex. The latter brittle/inflexible with respect to changing tarball 
structures but likely more easier to enforce a standard on.

                
> Remove dependency on deployed MR jars
> -------------------------------------
>
>                 Key: MAPREDUCE-4421
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4421
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>    Affects Versions: 2.0.0-alpha
>            Reporter: Arun C Murthy
>            Assignee: Jason Lowe
>         Attachments: MAPREDUCE-4421.patch, MAPREDUCE-4421.patch
>
>
> Currently MR AM depends on MR jars being deployed on all nodes via implicit 
> dependency on YARN_APPLICATION_CLASSPATH. 
> We should stop adding mapreduce jars to YARN_APPLICATION_CLASSPATH and, 
> probably, just rely on adding a shaded MR jar along with job.jar to the 
> dist-cache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to