[
https://issues.apache.org/jira/browse/HADOOP-10986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14105364#comment-14105364
]
Alejandro Abdelnur commented on HADOOP-10986:
---------------------------------------------
It seems the culprit for the significant size increase is in the documentation,
specifically protobuf javadocs:
{code}
$ cd hadoop-2.5.0/share/doc/hadoop
$ du -m -s *
55 api
119 common
1 css
1 dependency-analysis.html
1 hadoop-annotations
1 hadoop-archives
1 hadoop-assemblies
2 hadoop-auth
1 hadoop-auth-examples
1 hadoop-common-project
1 hadoop-datajoin
1 hadoop-dist
1 hadoop-distcp
1 hadoop-extras
1 hadoop-gridmix
1 hadoop-hdfs-bkjournal
11 hadoop-hdfs-httpfs
1 hadoop-hdfs-nfs
1 hadoop-hdfs-project
1 hadoop-mapreduce
3 hadoop-mapreduce-client
1 hadoop-mapreduce-examples
1 hadoop-maven-plugins
1 hadoop-minicluster
1 hadoop-minikdc
1 hadoop-nfs
1 hadoop-openstack
1 hadoop-pipes
725 hadoop-project-dist
1 hadoop-rumen
1 hadoop-sls
1 hadoop-streaming
1 hadoop-tools
5 hadoop-yarn
1 hadoop-yarn-project
618 hdfs
1 httpfs
1 images
1 index.html
1 mapreduce
1 project-reports.html
1 yarn
{code}
{code}
$ cd hadoop-2.5.0/share/doc/hadoop/
$ du -m -s hdfs/api/src-html/org/apache/hadoop/hdfs/server/namenode/
222 hdfs/api/src-html/org/apache/hadoop/hdfs/server/namenode/
{code}
Also it seems we have duplicate javadocs dirs:
{code}
$ cd hadoop-2.5.0/share/doc/hadoop/
$ find . -name api -type d
./api
./api/org/apache/hadoop/mapreduce/v2/api
./api/org/apache/hadoop/yarn/api
./api/org/apache/hadoop/yarn/client/api
./api/src-html/org/apache/hadoop/yarn/api
./api/src-html/org/apache/hadoop/yarn/client/api
./common/api
./hadoop-project-dist/hadoop-common/api
./hadoop-project-dist/hadoop-hdfs/api
./hdfs/api
{code}
> hadoop tarball is twice as big as prev. version and 6 times as big unpacked
> ---------------------------------------------------------------------------
>
> Key: HADOOP-10986
> URL: https://issues.apache.org/jira/browse/HADOOP-10986
> Project: Hadoop Common
> Issue Type: Bug
> Affects Versions: 2.5.0
> Reporter: André Kelpe
> Assignee: Karthik Kambatla
> Priority: Blocker
>
> I noticed that the binary tarball for 2.5.0 is almost 300MB, while 2.4.1 is
> only 132MB. Unpacking the latest tarball gives me 1.8 GB of stuff, with the
> majority in the "share" directory.
>
> {code}
> $ cd hadoop-2.4.1
> $ du -sh *
> 364K bin
> 356K etc
> 100K include
> 2,3M lib
> 128K libexec
> 24K LICENSE.txt
> 12K NOTICE.txt
> 12K README.txt
> 336K sbin
> 280M share
> {code}
> {code}
> $ cd hadoop-2.5.0
> $ du -sh *
> 512K bin
> 332K etc
> 100K include
> 4,6M lib
> 128K libexec
> 336K sbin
> 1,8G share
> {code}
> I also saw some warnings from tar while unpacking:
> {code}
> $ tar xf hadoop-2.5.0.tar.gz
> tar: Ignoring unknown extended header keyword `SCHILY.dev'
> tar: Ignoring unknown extended header keyword `SCHILY.ino'
> tar: Ignoring unknown extended header keyword `SCHILY.nlink'
> tar: Ignoring unknown extended header keyword `SCHILY.dev'
> tar: Ignoring unknown extended header keyword `SCHILY.ino'
> tar: Ignoring unknown extended header keyword `SCHILY.nlink'
> tar: Ignoring unknown extended header keyword `SCHILY.dev'
> tar: Ignoring unknown extended header keyword `SCHILY.ino'
> tar: Ignoring unknown extended header keyword `SCHILY.nlink'
> tar: Ignoring unknown extended header keyword `SCHILY.dev'
> tar: Ignoring unknown extended header keyword `SCHILY.ino'
> tar: Ignoring unknown extended header keyword `SCHILY.nlink'
> tar: Ignoring unknown extended header keyword `SCHILY.dev'
> tar: Ignoring unknown extended header keyword `SCHILY.ino'
> tar: Ignoring unknown extended header keyword `SCHILY.nlink'
> tar: Ignoring unknown extended header keyword `SCHILY.dev'
> tar: Ignoring unknown extended header keyword `SCHILY.ino'
> tar: Ignoring unknown extended header keyword `SCHILY.nlink'
> {code}
--
This message was sent by Atlassian JIRA
(v6.2#6252)