[ https://issues.apache.org/jira/browse/PIG-1411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12866008#action_12866008 ]
Gaurav Jain commented on PIG-1411: ---------------------------------- In general, HAR is a good idea for a use case with lots of small files. In namenode: -- Each block takes 200 bytes -- There are 3 replicas, so 600 bytes -- 200 bytes for inode for 1 block -- 800 - 1K bytes for 1 file with 1 block. Lets say, -- There are 128 files with 1M size. -- 128K bytes taken in namenode With HAR -- HDFS block size of 128M -- All the 128 1M blocks will be written to 1 block in a HAR part file -- 1K taken in namenode As seen the amount of memory consumption goes down considerably. So, in this use case, if fixed performance overhead is acceptable to application, then HAR is good choice for LONG RUNNING Jobs. However, for files >= 128M, HAR does not have siginificant memory savings. Expalined below > [Zebra] Can Zebra use HAR to reduce file/block count for namenode > ----------------------------------------------------------------- > > Key: PIG-1411 > URL: https://issues.apache.org/jira/browse/PIG-1411 > Project: Pig > Issue Type: New Feature > Components: impl > Affects Versions: 0.8.0 > Reporter: Gaurav Jain > Assignee: Gaurav Jain > Priority: Minor > Fix For: 0.8.0 > > > Due to column group structure, Zebra can create extra files for namenode to > remember. That means namenode taking more memory for Zebra related files. > The goal is to reduce the no of files/blocks > The idea among various options is to use HAR ( Hadoop Archive ). Hadoop > Archive reduces the block and file count by copying data from small files ( > 1M, 2M ...) into a hdfs-block of larger size. Thus, reducing the total no. of > blocks and files. > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.