[
https://issues.apache.org/jira/browse/MAPREDUCE-1700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856960#action_12856960
]
Scott Carey commented on MAPREDUCE-1700:
----------------------------------------
I'm glad you filed this. I was just getting frustrated with this issue myself
in the last couple weeks and have various thoughts on the issue. Some of these
ideas are raw and flawed, but here is what I have been thinking:
Ideally, the framework would limit the classes visible to a job to the minimum
required for job execution. A job could then bring in its own dependencies.
Also, if there was a built-in hadoop dependency hidden by default that a job
wanted, it could request access to it.
Similarly frustrating and related, is how a M/R job has to submit its whole job
jar to the cluster each time. I have a 28MB jar, and a workflow of about 35
dependent M/R jobs (A DAG of them). Towards the end of this chain, the jobs
get smaller and smaller in data size (the end ones are joining, augmenting,
transforming and sorting data aggregated by the earlier jobs).
Two big things account for more clock time than the 'heavy lifting' work of the
initial 'big data' jobs -- job submission time and scheduling inefficiencies.
The former is related to dependency management.
If the framework could support installing jars into an 'application'
classloader space and then jobs reference that space, task latency could be
reduced significantly as each job submission would not need to also submit all
its dependency jars. In my case, the job jar would probably become a couple
hundred K instead of almost 30MB -- or even zero K if the jobs could just be
stored and called. TaskTracker nodes could cache these application library
spaces to reduce job start-up time.
In some ways, the dependency management above is like an application server.
Each 'application' has its own classloader space, and there might be several
different jobs available in an 'application' -- analogous to several servlets
available in a web app. Like an app server, there will probably be a need
for a lib directory that is global, one that is exclusive to the framework, and
a per-application space.
There are some questions related to static variables related to such
classloader partitioning. With shared JVM's across tasks, users expect statics
to live from one task to another in the same job. This means the classloader
in a JVM corresponds with the Job ID and whether it is a M or R. Per-Job
classloaders could enable JVM recycling across jobs in the distant future
because disposing of a Job's classloader will free its static variables. That
in turn leads to the possibility of future reductions in start-up time and per
task costs.
> User supplied dependencies may conflict with MapReduce system JARs
> ------------------------------------------------------------------
>
> Key: MAPREDUCE-1700
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1700
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: task
> Reporter: Tom White
>
> If user code has a dependency on a version of a JAR that is different to the
> one that happens to be used by Hadoop, then it may not work correctly. This
> happened with user code using a different version of Avro, as reported
> [here|https://issues.apache.org/jira/browse/AVRO-493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852081#action_12852081].
> The problem is analogous to the one that application servers have with WAR
> loading. Using a specialized classloader in the Child JVM is probably the way
> to solve this.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira