[jira] [Resolved] (HADOOP-12363) Hadoop binary distributions contain many copies of the same jars

Allen Wittenauer (JIRA) Sat, 29 Aug 2015 09:45:06 -0700

     [ 
https://issues.apache.org/jira/browse/HADOOP-12363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Allen Wittenauer resolved HADOOP-12363.
---------------------------------------
       Resolution: Duplicate
    Fix Version/s: HADOOP-10115

Closing as a dupe.

> Hadoop binary distributions contain many copies of the same jars
> ----------------------------------------------------------------
>
>                 Key: HADOOP-12363
>                 URL: https://issues.apache.org/jira/browse/HADOOP-12363
>             Project: Hadoop Common
>          Issue Type: Improvement
>            Reporter: Benoit Sigoure
>            Priority: Minor
>             Fix For: HADOOP-10115
>
>
> [I noticed this 2 years 
> ago|https://twitter.com/tsunanet/status/384917643162972161] but this is 
> bugging me again so I'm finally filing a bug ;o
> The Hadoop binary distribution is insanely redundant.  Over 80% of the size 
> of the ~200MB tarballs distributed both by Apache upstream and by Cloudera is 
> made of duplicate files.
> Back when I was complaining about CDH 4.4.0, the Hadoop tarball contained 
> [3477 duplicate files, some of which had 98 copies in the 
> tarball|http://tsunanet.net/~tsuna/cdh440-dup-files.txt]!
> Now I'm looking at the official {{hadoop-2.7.1.tar.gz}} and I'm seeing 7 
> copies of {{jackson-mapper-asl-1.9.13.jar}}, {{jersey-server-1.9.jar}}, 
> {{protobuf-java-2.5.0.jar}}, etc, 6 copies of {{guava-11.0.2.jar}}, 
> {{xz-1.0.jar}}, {{commons-logging-1.1.3.jar}}, etc, 5 copies of 
> {{snappy-java-1.0.4.1.jar}}, etc etc etc.  All in all there are well over 200 
> files that appear at least twice in the tarball, and that account for 118MB 
> worth of files that could just be replaced with a symlink (assuming you don't 
> want to change the structure of the tarball at all).
> This is really not necessary :)
> Can we fix the distribution?  I'm sure Cloudera and others will fix their 
> distributions as well once this is fixed upstream (their distros exhibit a 
> substantially more acute version of this problem).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (HADOOP-12363) Hadoop binary distributions contain many copies of the same jars

Reply via email to