[
https://issues.apache.org/jira/browse/HADOOP-12363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Allen Wittenauer resolved HADOOP-12363.
---------------------------------------
Resolution: Duplicate
Fix Version/s: HADOOP-10115
Closing as a dupe.
> Hadoop binary distributions contain many copies of the same jars
> ----------------------------------------------------------------
>
> Key: HADOOP-12363
> URL: https://issues.apache.org/jira/browse/HADOOP-12363
> Project: Hadoop Common
> Issue Type: Improvement
> Reporter: Benoit Sigoure
> Priority: Minor
> Fix For: HADOOP-10115
>
>
> [I noticed this 2 years
> ago|https://twitter.com/tsunanet/status/384917643162972161] but this is
> bugging me again so I'm finally filing a bug ;o
> The Hadoop binary distribution is insanely redundant. Over 80% of the size
> of the ~200MB tarballs distributed both by Apache upstream and by Cloudera is
> made of duplicate files.
> Back when I was complaining about CDH 4.4.0, the Hadoop tarball contained
> [3477 duplicate files, some of which had 98 copies in the
> tarball|http://tsunanet.net/~tsuna/cdh440-dup-files.txt]!
> Now I'm looking at the official {{hadoop-2.7.1.tar.gz}} and I'm seeing 7
> copies of {{jackson-mapper-asl-1.9.13.jar}}, {{jersey-server-1.9.jar}},
> {{protobuf-java-2.5.0.jar}}, etc, 6 copies of {{guava-11.0.2.jar}},
> {{xz-1.0.jar}}, {{commons-logging-1.1.3.jar}}, etc, 5 copies of
> {{snappy-java-1.0.4.1.jar}}, etc etc etc. All in all there are well over 200
> files that appear at least twice in the tarball, and that account for 118MB
> worth of files that could just be replaced with a symlink (assuming you don't
> want to change the structure of the tarball at all).
> This is really not necessary :)
> Can we fix the distribution? I'm sure Cloudera and others will fix their
> distributions as well once this is fixed upstream (their distros exhibit a
> substantially more acute version of this problem).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)