[jira] [Commented] (HADOOP-11680) Deduplicate jars in convenience binary distribution

Allen Wittenauer (JIRA) Thu, 05 Mar 2015 12:20:57 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-11680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14349416#comment-14349416
 ]

Allen Wittenauer commented on HADOOP-11680:
-------------------------------------------

De-duping jars will definitely have a launch time impact, but I'm not sure if 
it will help the runtime impact much, especially in trunk where there was 
already a lot of work done to de-dupe the actual classpaths.

It's also worth pointing out that during the discussion of HADOOP-9902 both in 
the JIRA and out of it, many felt that it was too late to make the 'hadoop 
classpath' and other commands return anything but the classpath for all of the 
sub-projects. Post-shellprofiles, it might be quite a bit of work to undo this 
assumption if we decide that, no, 'yarn classpath' shouldn't list HDFS 
requirements as well. 

IMO, it might be easier to start with just making sure the sub-projects (HDFS, 
YARN, MAPRED), don't bundle the jars that are already required by common, since 
those subprojects also have common as a dependency.

> Deduplicate jars in convenience binary distribution
> ---------------------------------------------------
>
>                 Key: HADOOP-11680
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11680
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: build
>            Reporter: Sean Busbey
>            Assignee: Sean Busbey
>
> Pulled from discussion on HADOOP-11656 Colin wrote:
> {quote}
> bq. Andrew wrote: One additional note related to this, we can spend a lot of 
> time right now distributing 100s of MBs of jar dependencies when launching a 
> YARN job. Maybe this is ameliorated by the new shared distributed cache, but 
> I've heard this come up quite a bit as a complaint. If we could meaningfully 
> slim down our client, it could lead to a nice win.
> I'm frustrated that nobody responded to my earlier suggestion that we 
> de-duplicate jars. This would drastically reduce the size of our install, and 
> without rearchitecting anything.
> In fact I was so frustrated that I decided to write a program to do it myself 
> and measure the delta. Here it is:
> Before:
> {code}
> du -h /h
> 249M    /h
> {code}
> After:
> {code}
> du -h /h
> 140M    /h
> {code}
> Seems like deduplicating jars would be a much better project than splitting 
> into a client jar, if we really cared about this.
> <snip>
> {quote}

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HADOOP-11680) Deduplicate jars in convenience binary distribution

Reply via email to