Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.

The "Hive/LanguageManual/Archiving" page has been changed by PaulYang.
http://wiki.apache.org/hadoop/Hive/LanguageManual/Archiving?action=diff&rev1=5&rev2=6

--------------------------------------------------

  
  Due to the design of HDFS, the number of files in the filesystem directly 
affect the memory consumption in the namenode. While normally not a problem for 
small clusters, memory usage may hit the limits of accessible memory on a 
single machine when there are >50-100 million files. In such situations, it is 
advantageous to have as few files as possible.
  
- The use of 
[[http://hadoop.apache.org/mapreduce/docs/r0.21.0/hadoop_archives.html | Hadoop 
Archives]] is one approach to reducing the number of files in partitions. Hive 
has built-in support that allows users to easily move files in existing 
partitions to a Hadoop Archive (HAR) so that a partition that may once have 
consisted of 100's of files occupy ~3 files (depending on settings) However, 
the trade off is that queries may be slower due to the additional overhead in 
indirection.
+ The use of 
[[http://hadoop.apache.org/mapreduce/docs/r0.21.0/hadoop_archives.html | Hadoop 
Archives]] is one approach to reducing the number of files in partitions. Hive 
has built-in support to convert files in existing partitions to a Hadoop 
Archive (HAR) so that a partition that may once have consisted of 100's of 
files can occupy just ~3 files (depending on settings) However, the trade off 
is that queries may be slower due to the additional overhead in reading from 
the HAR.
  
  Note that archiving does NOT compress the files - HAR is analogous to the 
unix tar command.
  

Reply via email to