[ https://issues.apache.org/jira/browse/HADOOP-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12484548 ]
Raghu Angadi commented on HADOOP-1134: -------------------------------------- > It's worth noting that, if a .crc file cannot be found for a block, then the > upgrade should probably generate one (along with a warning). > So that, after the upgrade, all blocks should be guaranteed to have CRC > files. We still might fail softly when a CRC file is missing, > logging a warning and either generating CRCs on the fly or regenerating the > CRC file. I agree. We will generate local CRC file if we can not find ".crc" file data with a warning. I think datanode should fail softly with a warning if CRC file is missing. I don't think it is necessary to regenerate the crc file if one is missing outside upgrade mode.. it should be normal error condition. Thanks for feedback. I am preparing an HTML file and will include all the decisions made till now and later comments for this jira. Will attach the HTML file here. > Block level CRCs in HDFS > ------------------------ > > Key: HADOOP-1134 > URL: https://issues.apache.org/jira/browse/HADOOP-1134 > Project: Hadoop > Issue Type: New Feature > Components: dfs > Reporter: Raghu Angadi > Assigned To: Raghu Angadi > > Currently CRCs are handled at FileSystem level and are transparent to core > HDFS. See recent improvement HADOOP-928 ( that can add checksums to a given > filesystem ) regd more about it. Though this served us well there a few > disadvantages : > 1) This doubles namespace in HDFS ( or other filesystem implementations ). In > many cases, it nearly doubles the number of blocks. Taking namenode out of > CRCs would nearly double namespace performance both in terms of CPU and > memory. > 2) Since CRCs are transparent to HDFS, it can not actively detect corrupted > blocks. With block level CRCs, Datanode can periodically verify the checksums > and report corruptions to namnode such that name replicas can be created. > We propose to have CRCs maintained for all HDFS data in much the same way as > in GFS. I will update the jira with detailed requirements and design. This > will include same guarantees provided by current implementation and will > include a upgrade of current data. > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.