[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856960#action_12856960
 ] 

Scott Carey commented on MAPREDUCE-1700:
----------------------------------------

I'm glad you filed this.  I was just getting frustrated with this issue myself 
in the last couple weeks and have various thoughts on the issue.  Some of these 
ideas are raw and flawed, but  here is what I have been thinking:

Ideally, the framework would limit the classes visible to a job to the minimum 
required for job execution.  A job could then bring in its own dependencies.  
Also, if there was a built-in hadoop dependency hidden by default that a job 
wanted, it could request access to it.

Similarly frustrating and related, is how a M/R job has to submit its whole job 
jar to the cluster each time.  I have a 28MB jar, and a workflow of about 35 
dependent M/R jobs (A DAG of them).   Towards the end of this chain, the jobs 
get smaller and smaller in data size (the end ones are joining, augmenting, 
transforming and sorting data aggregated by the earlier jobs).  
Two big things account for more clock time than the 'heavy lifting' work of the 
initial 'big data' jobs -- job submission time and scheduling inefficiencies.  
The former is related to dependency management.

If the framework could support installing jars into an 'application' 
classloader space and then jobs reference that space, task latency could be 
reduced significantly as each job submission would not need to also submit all 
its dependency jars.  In my case, the job jar would probably become a couple 
hundred K instead of almost 30MB -- or even zero K if the jobs could just be 
stored and called.  TaskTracker nodes could cache these application library 
spaces to reduce job start-up time.

In some ways, the dependency management above is like an application server.  
Each 'application' has its own classloader space, and there might be several 
different jobs available in an 'application' -- analogous to several servlets 
available in a web app.    Like an app server, there will probably be a need 
for a lib directory that is global, one that is exclusive to the framework, and 
a per-application space.


There are some questions related to static variables related to such 
classloader partitioning.  With shared JVM's across tasks, users expect statics 
to live from one task to another in the same job.  This means the classloader 
in a JVM corresponds with the Job ID and whether it is a M or R.  Per-Job 
classloaders could enable JVM recycling across jobs in the distant future 
because disposing of a Job's classloader will free its static variables.  That 
in turn leads to the possibility of future reductions in start-up time and per 
task costs.

> User supplied dependencies may conflict with MapReduce system JARs
> ------------------------------------------------------------------
>
>                 Key: MAPREDUCE-1700
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1700
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: task
>            Reporter: Tom White
>
> If user code has a dependency on a version of a JAR that is different to the 
> one that happens to be used by Hadoop, then it may not work correctly. This 
> happened with user code using a different version of Avro, as reported 
> [here|https://issues.apache.org/jira/browse/AVRO-493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852081#action_12852081].
> The problem is analogous to the one that application servers have with WAR 
> loading. Using a specialized classloader in the Child JVM is probably the way 
> to solve this.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to