[jira] Commented: (HADOOP-1134) Block level CRCs in HDFS

Sameer Paranjpye (JIRA) Tue, 20 Mar 2007 13:10:03 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12482534
 ]


Sameer Paranjpye commented on HADOOP-1134:
------------------------------------------

+1 for offline upgrades.

>> Owen O'Malley [20/Mar/07 12:59 PM] I think the inline crcs are too 
>> problematic. They will add a mapping between logical and physical offsets 
>> into the block that will hit a 
>> fair amount of code. If the side file is opened with a 4k buffer, it will 
>> only take 2 reads of the side file to handle the entire block (assuming 4B 
>> CRC/64KB and 128MB 
>> blocks). It also is much much easier to handle upgrade.

It takes only 2 reads to handle the entire block which is good.  But it takes 
those same 2 reads to handle a tiny fraction of the block as well, which is 
where the downside appears. It's quite clear that doing inline checksums makes 
the upgrade process a lot harder. The question is whether or not taking the hit 
of a difficult upgrade and complicating the data access code is a reasonable 
price to pay for halving the number of seeks in the system for good. It feels 
like it is, thoughts?








> Block level CRCs in HDFS
> ------------------------
>
>                 Key: HADOOP-1134
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1134
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: Raghu Angadi
>         Assigned To: Raghu Angadi
>
> Currently CRCs are handled at FileSystem level and are transparent to core 
> HDFS. See recent improvement HADOOP-928 ( that can add checksums to a given 
> filesystem ) regd more about it. Though this served us well there a few 
> disadvantages :
> 1) This doubles namespace in HDFS ( or other filesystem implementations ). In 
> many cases, it nearly doubles the number of blocks. Taking namenode out of 
> CRCs would nearly double namespace performance both in terms of CPU and 
> memory.
> 2) Since CRCs are transparent to HDFS, it can not actively detect corrupted 
> blocks. With block level CRCs, Datanode can periodically verify the checksums 
> and report corruptions to namnode such that name replicas can be created.
> We propose to have CRCs maintained for all HDFS data in much the same way as 
> in GFS. I will update the jira with detailed requirements and design. This 
> will include same guarantees provided by current implementation and will 
> include a upgrade of current data.
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1134) Block level CRCs in HDFS

Reply via email to