[jira] [Updated] (MAPREDUCE-4421) Remove dependency on deployed MR jars

Jason Lowe (JIRA) Thu, 25 Jul 2013 11:08:29 -0700

     [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jason Lowe updated MAPREDUCE-4421:
----------------------------------

    Attachment: MAPREDUCE-4421.patch

Submitting a patch to try to move this forward.  We're very interested in the 
ability to patch issues in the MapReduce framework without having to bring down 
the cluster and/or push a new version to all nodes.

This patch adds a new config, {{mapreduce.application.framework.path}}, which 
defaults to being unset.  If set, it specifies a path to an archive containing 
the MR framework to use with the job.  Normally this would point to a public 
location within HDFS, and the archive would contain all the MR jars and their 
dependencies, i.e.: MR jars, YARN client jars, HDFS client, common, and all 
their dependencies.

This allows ops to deposit a single archive into HDFS that contains the MR 
framework and configure mapred-site.xml to use it.  That framework is then 
lazily deployed to the nodes.  A new version can be uploaded to another path, 
the mapred-site.xml updated, and then all future jobs run with the new version 
while all currently running jobs proceed with the previous version.  Or ops can 
avoid pushing the mapred-site.xml change out to all gateway/launcher boxes by 
using a standard path symlink that always points to the current version to use. 
 New versions can be deployed, the symlink moved to them, and jobs implicitly 
pick up the new version without pushing a corresponding mapred-site.xml change.

I've tested this by taking the entire hadoop-3.0.0-SNAPSHOT.tar.gz file and 
placing it in HDFS under /mapred/.  Admittedly, this is not the most efficient 
deployment, but it does include everything necessary.  I then set 
mapreduce.application.framework.path to 
/mapred/hadoop-3.0.0-SNAPSHOT.tar.gz#mr-framework and 
mapreduce.application.classpath to:

{noformat}
$PWD/mr-framework/hadoop-3.0.0-SNAPSHOT/share/hadoop/mapreduce/*:$PWD/mr-framework/hadoop-3.0.0-SNAPSHOT/share/hadoop/mapreduce/lib/*:$PWD/mr-framework/hadoop-3.0.0-SNAPSHOT/share/hadoop/common/*:$PWD/mr-framework/hadoop-3.0.0-SNAPSHOT/share/hadoop/common/lib/*:$PWD/mr-framework/hadoop-3.0.0-SNAPSHOT/share/hadoop/yarn/*:$PWD/mr-framework/hadoop-3.0.0-SNAPSHOT/share/hadoop/yarn/lib/*:$PWD/mr-framework/hadoop-3.0.0-SNAPSHOT/share/hadoop/hdfs/*:$PWD/mr-framework/hadoop-3.0.0-SNAPSHOT/share/hadoop/hdfs/lib/*
{noformat}

The job then ran with my specified version of the MR framework instead of the 
one deployed to the nodes.  The application classpath is complicated because I 
used the standard distribution tarball.  I could have easily built a custom 
tarball with all the jars at the top directory and simply had a classpath of:

{noformat}
$PWD/mr-framework/*.jar
{noformat}

The framework is lazily deployed via the distributed cache, so nodes take a 
localization hit the first time they see a job with a specified framework path. 
 However subsequent jobs with the same framework run quickly, and I saw no 
performance difference between jobs using a custom framework and jobs using the 
cluster-installed framework on nodes that had already localized the specified 
framework.

Note that there is still a dependency on deployed MR jars with respect to the 
shuffle service running on all the nodes.  With this patch, new MR versions can 
only be used when the old shuffle service on all nodes is compatible with the 
new version.  Fixing this requires the ability to specify auxiliary services 
with YARN application submissions and have those lazily deploy to nodes that 
are allocated for the application.  (And ideally subsequently refcounted and 
retired once no longer necessary.)
                
> Remove dependency on deployed MR jars
> -------------------------------------
>
>                 Key: MAPREDUCE-4421
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4421
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>    Affects Versions: 2.0.0-alpha
>            Reporter: Arun C Murthy
>            Assignee: Vinod Kumar Vavilapalli
>         Attachments: MAPREDUCE-4421.patch
>
>
> Currently MR AM depends on MR jars being deployed on all nodes via implicit 
> dependency on YARN_APPLICATION_CLASSPATH. 
> We should stop adding mapreduce jars to YARN_APPLICATION_CLASSPATH and, 
> probably, just rely on adding a shaded MR jar along with job.jar to the 
> dist-cache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-4421) Remove dependency on deployed MR jars

Reply via email to