Archiving" by PaulYang

Apache Wiki Tue, 02 Nov 2010 14:25:59 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The "Hive/LanguageManual/Archiving" page has been changed by PaulYang.
http://wiki.apache.org/hadoop/Hive/LanguageManual/Archiving?action=diff&rev1=4&rev2=5

--------------------------------------------------

  
  == Overview ==
  
- Due to the design of HDFS, the number of files in the filesystem directly 
affect the memory consumption in the namenode. While normally not a problem for 
small clusters, memory usage may hit the limits of accessible memory on a 
single machine when there are >50-100 million files. Consequently, it is 
advantageous to have as few files as possible.
+ Due to the design of HDFS, the number of files in the filesystem directly 
affect the memory consumption in the namenode. While normally not a problem for 
small clusters, memory usage may hit the limits of accessible memory on a 
single machine when there are >50-100 million files. In such situations, it is 
advantageous to have as few files as possible.
  
- The use of 
[[http://hadoop.apache.org/mapreduce/docs/r0.21.0/hadoop_archives.html | Hadoop 
Archives]] is one approach to reducing the number of files in a partition. Hive 
has built-in support that allows users to easily move files in existing 
partitions to a Hadoop Archive (HAR) file so that a partition that may once 
have consisted of 100's of files occupy ~3 files (depending on settings) 
However, the trade off is that queries may be slower due to the additional 
overhead in indirection.
+ The use of 
[[http://hadoop.apache.org/mapreduce/docs/r0.21.0/hadoop_archives.html | Hadoop 
Archives]] is one approach to reducing the number of files in partitions. Hive 
has built-in support that allows users to easily move files in existing 
partitions to a Hadoop Archive (HAR) so that a partition that may once have 
consisted of 100's of files occupy ~3 files (depending on settings) However, 
the trade off is that queries may be slower due to the additional overhead in 
indirection.
  
  Note that archiving does NOT compress the files - HAR is analogous to the 
unix tar command.

[Hadoop Wiki] Update of "Hive/LanguageManual/Archiving" by PaulYang

Reply via email to