Benoit Sigoure created HADOOP-12363:
---------------------------------------
Summary: Hadoop binary distributions contain many copies of the
same jars
Key: HADOOP-12363
URL: https://issues.apache.org/jira/browse/HADOOP-12363
Project: Hadoop Common
Issue Type: Improvement
Reporter: Benoit Sigoure
Priority: Minor
[I noticed this 2 years
ago|https://twitter.com/tsunanet/status/384917643162972161] but this is bugging
me again so I'm finally filing a bug ;o
The Hadoop binary distribution is insanely redundant. Over 80% of the size of
the ~200MB tarballs distributed both by Apache upstream and by Cloudera is made
of duplicate files.
Back when I was complaining about CDH 4.4.0, the Hadoop tarball contained [3477
duplicate files, some of which had 98 copies in the
tarball|http://tsunanet.net/~tsuna/cdh440-dup-files.txt]!
Now I'm looking at the official {{hadoop-2.7.1.tar.gz}} and I'm seeing 7 copies
of {{jackson-mapper-asl-1.9.13.jar}}, {{jersey-server-1.9.jar}},
{{protobuf-java-2.5.0.jar}}, etc, 6 copies of {{guava-11.0.2.jar}},
{{xz-1.0.jar}}, {{commons-logging-1.1.3.jar}}, etc, 5 copies of
{{snappy-java-1.0.4.1.jar}}, etc etc etc. All in all there are well over 200
files that appear at least twice in the tarball, and that account for 118MB
worth of files that could just be replaced with a symlink (assuming you don't
want to change the structure of the tarball at all).
This is really not necessary :)
Can we fix the distribution? I'm sure Cloudera and others will fix their
distributions as well once this is fixed upstream (their distros exhibit a
substantially more acute version of this problem).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)