Dear Wiki user, You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.
The "Hive/LanguageManual/Archiving" page has been changed by PaulYang. http://wiki.apache.org/hadoop/Hive/LanguageManual/Archiving -------------------------------------------------- New page: = Archiving for File Count Reduction = Note: Archiving should be considered an advanced command due to the caveats involved. <<TableOfContents>> == Overview == Due to the design of HDFS, the number of files on HDFS directly affect the memory consumption in the namenode. While normally not a problem for small clusters, memory usage may hit the limits of accessible memory on a single machine when there are >50-100 million files. Consequently, it is advantageous to have as few files as possible. The use of [[http://hadoop.apache.org/mapreduce/docs/r0.21.0/hadoop_archives.html | Hadoop Archives]] is one approach to reducing the number of files in a partition. Hive has built-in support that allows users to easily move files in existing partitions to a Hadoop Archive (HAR) file so that a partition that may once have consisted of 100's of files occupy ~3 files (depending on settings) However, the trade off is that queries may be slower due to the additional overhead in indirection. Note that archiving does NOT compress the files - HAR is analogous to the unix tar command. == Settings == There are 3 settings that should be configured before archiving is used. (Example values are shown) {{{ hive> set hive.archive.enabled=true; hive> set hive.archive.har.parentdir.settable=true; hive> set har.partfile.size=1099511627776; }}} {{{hive.archive.har.parentdir.settable}}} controls whether archiving operations are enabled. {{{hive.archive.har.parentdir.settable}}} informs Hive whether the parent directory is set-able while creating the archive. In the latest version of Hadoop the {{{-p}}} option could be set to specify the root directory of the archive. For example, if {{{/dir1/dir2/file}} were archived with {{{/dir1}}} as the parent directory, then the resulting archive file will contain the directory structure {{{dir2/file}}}. In older versions of Hadoop, this option was not available and therefore Hive must be configured to accommodate this limitation. {{{har.partfile.size}}} controls the size of the files that make up the archive. The archive will contain {{{har.partfile.size/[Size of partition]}}} files, rounded up. Higher values mean fewer files, but will result in longer archiving times due to the reduced number of mappers. == Usage == === Archive === Once the configuration values are set, a partition can be archived with the command: {{{ ALTER TABLE table_name ARCHIVE PARTITION (partition_col = partition_col_value, partition_col = partiton_col_value, ...) }}} e.g. {{{ ALTER TABLE srcpart ARCHIVE PARTITION(ds='2008-04-08', hr='12') }}} Once the command is issued, a mapreduce job will be launched that performs the archiving. Note that there is no output on the CLI to indicate process. === Unarchive === The partition can be reverted back to its original files with the unarchive command: {{{ ALTER TABLE srcpart UNARCHIVE PARTITION(ds='2008-04-08', hr='12') }}} == Cautions and Limitations == * In some older versions of Hadoop, HAR had a few bugs that could cause data loss / corruption. Be sure that these patches are integrated into your version of Hadoop: [[https://issues.apache.org/jira/browse/MAPREDUCE-1548]] [[https://issues.apache.org/jira/browse/HADOOP-6591]] [[https://issues.apache.org/jira/browse/MAPREDUCE-2143]] * The HarFileSystem class still has a few bugs that have yet to be fixed: [[https://issues.apache.org/jira/browse/MAPREDUCE-1752]] [[https://issues.apache.org/jira/browse/MAPREDUCE-1877]] Hive comes with the HiveHarFileSystem class that addresses some of these issues, and is by default the value for {{{fs.har.impl}}}. Keep this in mind if you're rolling own version of HarFileSystem. * The default HiveHarFileSystem.getFileBlockLocations() has NO LOCALITY. That means it may introduce higher network loads or reduced performance. * Archived partitions cannot be overwritten with INSERT OVERWRITE ... The partition must be unarchived first. == Under the hood ==
