Archiving" by PaulYang

Apache Wiki Mon, 01 Nov 2010 18:26:20 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The "Hive/LanguageManual/Archiving" page has been changed by PaulYang.
http://wiki.apache.org/hadoop/Hive/LanguageManual/Archiving

--------------------------------------------------

New page:
= Archiving for File Count Reduction =

Note: Archiving should be considered an advanced command due to the caveats 
involved.

<<TableOfContents>>

== Overview ==

Due to the design of HDFS, the number of files on HDFS directly affect the 
memory consumption in the namenode. While normally not a problem for small 
clusters, memory usage may hit the limits of accessible memory on a single 
machine when there are >50-100 million files. Consequently, it is advantageous 
to have as few files as possible.

The use of 
[[http://hadoop.apache.org/mapreduce/docs/r0.21.0/hadoop_archives.html | Hadoop 
Archives]] is one approach to reducing the number of files in a partition. Hive 
has built-in support that allows users to easily move files in existing 
partitions to a Hadoop Archive (HAR) file so that a partition that may once 
have consisted of 100's of files occupy ~3 files (depending on settings) 
However, the trade off is that queries may be slower due to the additional 
overhead in indirection.

Note that archiving does NOT compress the files - HAR is analogous to the unix 
tar command.

== Settings ==

There are 3 settings that should be configured before archiving is used. 
(Example values are shown)

{{{
hive> set hive.archive.enabled=true;
hive> set hive.archive.har.parentdir.settable=true;
hive> set har.partfile.size=1099511627776;
}}}

{{{hive.archive.har.parentdir.settable}}} controls whether archiving operations 
are enabled.

{{{hive.archive.har.parentdir.settable}}} informs Hive whether the parent 
directory is set-able while creating the archive. In the latest version of 
Hadoop the {{{-p}}} option could be set to specify the root directory of the 
archive. For example, if {{{/dir1/dir2/file}} were archived with {{{/dir1}}} as 
the parent directory, then the resulting archive file will contain the 
directory structure {{{dir2/file}}}. In older versions of Hadoop, this option 
was not available and therefore Hive must be configured to accommodate this 
limitation. 

{{{har.partfile.size}}} controls the size of the files that make up the 
archive. The archive will contain {{{har.partfile.size/[Size of partition]}}} 
files, rounded up. Higher values mean fewer files, but will result in longer 
archiving times due to the reduced number of mappers.

== Usage ==

=== Archive ===
Once the configuration values are set, a partition can be archived with the 
command:

{{{
ALTER TABLE table_name ARCHIVE PARTITION (partition_col = partition_col_value, 
partition_col = partiton_col_value, ...)
}}}

e.g. 
{{{
ALTER TABLE srcpart ARCHIVE PARTITION(ds='2008-04-08', hr='12')
}}}

Once the command is issued, a mapreduce job will be launched that performs the 
archiving. Note that there is no output on the CLI to indicate process.

=== Unarchive ===

The partition can be reverted back to its original files with the unarchive 
command:

{{{
ALTER TABLE srcpart UNARCHIVE PARTITION(ds='2008-04-08', hr='12')
}}}

== Cautions and Limitations ==

 * In some older versions of Hadoop, HAR had a few bugs that could cause data 
loss / corruption. Be sure that these patches are integrated into your version 
of Hadoop:

[[https://issues.apache.org/jira/browse/MAPREDUCE-1548]]

[[https://issues.apache.org/jira/browse/HADOOP-6591]]

[[https://issues.apache.org/jira/browse/MAPREDUCE-2143]]

 * The HarFileSystem class still has a few bugs that have yet to be fixed:

[[https://issues.apache.org/jira/browse/MAPREDUCE-1752]]

[[https://issues.apache.org/jira/browse/MAPREDUCE-1877]]

Hive comes with the HiveHarFileSystem class that addresses some of these 
issues, and is by default the value for {{{fs.har.impl}}}. Keep this in mind if 
you're rolling own version of HarFileSystem. 

 * The default HiveHarFileSystem.getFileBlockLocations() has NO LOCALITY. That 
means it may introduce higher network loads or reduced performance.

 * Archived partitions cannot be overwritten with INSERT OVERWRITE ... The 
partition must be unarchived first.

== Under the hood ==

[Hadoop Wiki] Update of "Hive/LanguageManual/Archiving" by PaulYang

Reply via email to