[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12833542#action_12833542
 ] 

Rodrigo Schmidt commented on MAPREDUCE-1491:
--------------------------------------------

Dhruba, thanks for reviewing the code.

As for your question, with the current code the .har files are never deleted 
automatically. In the scenario you presented, when you delete one of the files, 
the har file is left as it is, with all 10 parity files inside. I'm doing that 
exactly to avoid leaving the other files with less redundancy.

Besides, if you recreate one of the files, a new parity file is generated 
outside the har, but the code on the RaidNode is smart enough to pick the 
parity file outside har.

The downside of the current patch is that even if all files are deleted or 
recreated, the har file is never deleted and new parity files are created 
outside it. In the future I plan to fix that and enable the recreation of har 
files when they become obsolete. I didn't do that now to keep the code simple 
enough to be reviewed and deployed quickly.

Besides, the main idea behind using har on raid is to do that for files that 
won't probably change in the future (otherwise recreating things becomes too 
expensive). The code uses a raid property called time_before_har (on each 
policy) to decide when the files are old enough to be hared. Setting this 
variable properly will avoid wasting space in most practical cases.

Let me know what you think of this.


> Use HAR filesystem to merge parity files 
> -----------------------------------------
>
>                 Key: MAPREDUCE-1491
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1491
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: contrib/raid
>            Reporter: Rodrigo Schmidt
>            Assignee: Rodrigo Schmidt
>         Attachments: MAPREDUCE-1491.0.patch
>
>
> The HDFS raid implementation (HDFS-503) creates a parity file for every file 
> that is RAIDed. This puts additional burden on the memory requirements of the 
> namenode. It will be  nice if the parity files are combined together using 
> the HadoopArchive (har) format.
> This was (HDFS-684) before, but raid migrated to MAPREDUCE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to