[ 
https://issues.apache.org/jira/browse/HADOOP-11680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14349431#comment-14349431
 ] 

Sean Busbey commented on HADOOP-11680:
--------------------------------------

{quote}
IMO, it might be easier to start with just making sure the sub-projects (HDFS, 
YARN, MAPRED), don't bundle the jars that are already required by common, since 
those subprojects also have common as a dependency.
{quote}

I'm not familiar with how the hadoop assemblies work yet, but would doing this 
be as simple as having those components list common and its dependencies 
provided? If I use maven to say a dependency should already be present at 
runtime, common maven assemblies will skip including those artifacts in the 
bundle.

Presuming that also works for Hadoop assemblies, if libraries still show up in 
the "foo classpath" commands is it reasonable to expect everything will work?

> Deduplicate jars in convenience binary distribution
> ---------------------------------------------------
>
>                 Key: HADOOP-11680
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11680
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: build
>            Reporter: Sean Busbey
>            Assignee: Sean Busbey
>
> Pulled from discussion on HADOOP-11656 Colin wrote:
> {quote}
> bq. Andrew wrote: One additional note related to this, we can spend a lot of 
> time right now distributing 100s of MBs of jar dependencies when launching a 
> YARN job. Maybe this is ameliorated by the new shared distributed cache, but 
> I've heard this come up quite a bit as a complaint. If we could meaningfully 
> slim down our client, it could lead to a nice win.
> I'm frustrated that nobody responded to my earlier suggestion that we 
> de-duplicate jars. This would drastically reduce the size of our install, and 
> without rearchitecting anything.
> In fact I was so frustrated that I decided to write a program to do it myself 
> and measure the delta. Here it is:
> Before:
> {code}
> du -h /h
> 249M    /h
> {code}
> After:
> {code}
> du -h /h
> 140M    /h
> {code}
> Seems like deduplicating jars would be a much better project than splitting 
> into a client jar, if we really cared about this.
> <snip>
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to