Archiving" by PaulYang

Apache Wiki Tue, 02 Nov 2010 14:07:23 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The "Hive/LanguageManual/Archiving" page has been changed by PaulYang.
http://wiki.apache.org/hadoop/Hive/LanguageManual/Archiving?action=diff&rev1=2&rev2=3

--------------------------------------------------

  
  == Overview ==
  
- Due to the design of HDFS, the number of files on HDFS directly affect the 
memory consumption in the namenode. While normally not a problem for small 
clusters, memory usage may hit the limits of accessible memory on a single 
machine when there are >50-100 million files. Consequently, it is advantageous 
to have as few files as possible.
+ Due to the design of HDFS, the number of files in the filesystem directly 
affect the memory consumption in the namenode. While normally not a problem for 
small clusters, memory usage may hit the limits of accessible memory on a 
single machine when there are >50-100 million files. Consequently, it is 
advantageous to have as few files as possible.
  
  The use of 
[[http://hadoop.apache.org/mapreduce/docs/r0.21.0/hadoop_archives.html | Hadoop 
Archives]] is one approach to reducing the number of files in a partition. Hive 
has built-in support that allows users to easily move files in existing 
partitions to a Hadoop Archive (HAR) file so that a partition that may once 
have consisted of 100's of files occupy ~3 files (depending on settings) 
However, the trade off is that queries may be slower due to the additional 
overhead in indirection.
  
@@ -26, +26 @@

  
  {{{hive.archive.har.parentdir.settable}}} controls whether archiving 
operations are enabled.
  
- {{{hive.archive.har.parentdir.settable}}} informs Hive whether the parent 
directory is set-able while creating the archive. In the latest version of 
Hadoop the {{{-p}}} option could be set to specify the root directory of the 
archive. For example, if {{{/dir1/dir2/file}} were archived with {{{/dir1}}} as 
the parent directory, then the resulting archive file will contain the 
directory structure {{{dir2/file}}}. In older versions of Hadoop, this option 
was not available and therefore Hive must be configured to accommodate this 
limitation. 
+ {{{hive.archive.har.parentdir.settable}}} informs Hive whether the parent 
directory is set-able while creating the archive. In the latest version of 
Hadoop the {{{-p}}} option could be set to specify the root directory of the 
archive. For example, if {{{/dir1/dir2/file}}} were archived with {{{/dir1}}} 
as the parent directory, then the resulting archive file will contain the 
directory structure {{{dir2/file}}}. In older versions of Hadoop, this option 
was not available and therefore Hive must be configured to accommodate this 
limitation. 
  
  {{{har.partfile.size}}} controls the size of the files that make up the 
archive. The archive will contain {{{har.partfile.size/[Size of partition]}}} 
files, rounded up. Higher values mean fewer files, but will result in longer 
archiving times due to the reduced number of mappers.
  
@@ -44, +44 @@

  ALTER TABLE srcpart ARCHIVE PARTITION(ds='2008-04-08', hr='12')
  }}}
  
- Once the command is issued, a mapreduce job will be launched that performs 
the archiving. Note that there is no output on the CLI to indicate process.
+ Once the command is issued, a mapreduce job will be to perform the archiving. 
Unlike running Hive queries, there is no output on the CLI to indicate process.
  
  === Unarchive ===
  
@@ -56, +56 @@

  
  == Cautions and Limitations ==
  
-  * In some older versions of Hadoop, HAR had a few bugs that could cause data 
loss / corruption. Be sure that these patches are integrated into your version 
of Hadoop:
+  * In some older versions of Hadoop, HAR had a few bugs that could cause data 
loss or other errors. Be sure that these patches are integrated into your 
version of Hadoop:
  
  [[https://issues.apache.org/jira/browse/MAPREDUCE-1548]]
  
@@ -72, +72 @@

  
  Hive comes with the HiveHarFileSystem class that addresses some of these 
issues, and is by default the value for {{{fs.har.impl}}}. Keep this in mind if 
you're rolling own version of HarFileSystem. 
  
-  * The default HiveHarFileSystem.getFileBlockLocations() has '''no 
locality''. That means it may introduce higher network loads or reduced 
performance.
+  * The default HiveHarFileSystem.getFileBlockLocations() has '''no 
locality'''. That means it may introduce higher network loads or reduced 
performance.
  
   * Archived partitions cannot be overwritten with INSERT OVERWRITE ... The 
partition must be unarchived first.
   
-  * If two processes attempt to archive the same partition at the same time, 
bad things can happen. (Need to implement concurrency support..)
+  * If two processes attempt to archive the same partition at the same time, 
bad things could happen. (Need to implement concurrency support..)
  
  == Under the hood ==
  
- Internally, when a partition is archived, t
+ Internally, when a partition is archived, a HAR is created using the files 
from the partition's original location (e.g. {{{/warehouse/table/ds=1}}}). The 
parent directory of the partition is specified to be the same as the original 
location and the resulting archive is named 'data.har'. The archive is moved 
under the original directory (e.g. {{{/warehouse/table/ds=1/data.har}}}) and 
the partition's location is changed to point to the archive.

[Hadoop Wiki] Update of "Hive/LanguageManual/Archiving" by PaulYang

Reply via email to