Dear Wiki user, You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.
The "Hive/LanguageManual/Archiving" page has been changed by PaulYang. http://wiki.apache.org/hadoop/Hive/LanguageManual/Archiving?action=diff&rev1=4&rev2=5 -------------------------------------------------- == Overview == - Due to the design of HDFS, the number of files in the filesystem directly affect the memory consumption in the namenode. While normally not a problem for small clusters, memory usage may hit the limits of accessible memory on a single machine when there are >50-100 million files. Consequently, it is advantageous to have as few files as possible. + Due to the design of HDFS, the number of files in the filesystem directly affect the memory consumption in the namenode. While normally not a problem for small clusters, memory usage may hit the limits of accessible memory on a single machine when there are >50-100 million files. In such situations, it is advantageous to have as few files as possible. - The use of [[http://hadoop.apache.org/mapreduce/docs/r0.21.0/hadoop_archives.html | Hadoop Archives]] is one approach to reducing the number of files in a partition. Hive has built-in support that allows users to easily move files in existing partitions to a Hadoop Archive (HAR) file so that a partition that may once have consisted of 100's of files occupy ~3 files (depending on settings) However, the trade off is that queries may be slower due to the additional overhead in indirection. + The use of [[http://hadoop.apache.org/mapreduce/docs/r0.21.0/hadoop_archives.html | Hadoop Archives]] is one approach to reducing the number of files in partitions. Hive has built-in support that allows users to easily move files in existing partitions to a Hadoop Archive (HAR) so that a partition that may once have consisted of 100's of files occupy ~3 files (depending on settings) However, the trade off is that queries may be slower due to the additional overhead in indirection. Note that archiving does NOT compress the files - HAR is analogous to the unix tar command.
