[ https://issues.apache.org/jira/browse/HADOOP-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12484970 ]
Raghu Angadi commented on HADOOP-1134: -------------------------------------- "io.bytes.per.checksum" will still be used for the expected purpose and will dictate bytes/checksum for all the new data created by the client. Doug prefered client dictating this value and I preferred Namenode or Datanode informing client (using the same config). Similar to current behavior, each checksum file has a header that indicates bytes/checksum. Thus at any time each block has its own bytes/checksum it does not need to match with other blocks or even with other replicas. When a block is copied to another datanode, source datanode decides what bytes/checksum is. > Do we simply copy existing checksum data or do we re-generate it? During upgrade, we simply copy, of course with new header. This is Doug's preference since this will speed up the upgrade. I agree but don't mind implementing forced check during upgrade. During block replication, destination datanode verifies the checksum and create its local copy. End result would be that source and destination will have the same content in checksum file (unless the header format has changed). > I don't think simply copying checksum data is enough since the checksums can > themselves be corrupt. We need some level of validation. > We can compare > copies of the checksum data against each other, if we find a majority of > copies that match then we treat those as > authoritative. But what happens when we don't find a majority? Or we can > re-generate checksum data on the Datanode and validate it > against the existing data. If we cannot get old CRC data for any reason, we will generate one based on the local data (which could be wrong). There are two options to validate upgraded data (for simplicity all the details and error conditions are not explained) : 1) use old CRCs (Doug's choice) 2) check CRC of each replica and choose the majority (Sameer's choice) 3) Combination of (1) and (2). i.e. use (2) if (1) fails etc. This option is proposed only now. At this point, I would leave it you guys to decide which one we should do. Please choose one. > How does a Datanode discover authoritative sources of checksum data for it's > blocks? Is this during upgrade? > This works while the upgrade is in progress but perhaps it can be extended to > deal with Datanodes that join the system after the > upgrade is complete. If a Datanode joins after a complete upgrade and crc > file deletion, the Namenode could redirect it to other > Datanodes that have copies of it's blocks, the new Datanode can then pull > block level CRC files from it's peers, validate it's data and > perform an upgrade even though the .crc files are gone. My expectation was that once the upgrade is considered done by the namenode, any datanode that comes up with old database, will shutdown with a clear error message with out entering into upgrade phrase. The following two conditions should be met before upgrade is considered done : 1) All the datanodes that registered should report completion of their upgrade. If Namenode restarts, each datanode will re-register and inform again about their completion. 2) Similar to current safemode, "dfs.safemode.threshold.pct" of the blocks that belong to non .crc files, should have at least "dfs.replication.min" replicas reported upgraded. Of cource, we can still do what you propose for datanodes that come up with old data after the above conditions are met.. it it is considered required. It means that some of the upgrade specific code could spread a little bit more into normal operation of namenode. Upgrade is already enough complicated that this might not add much more code. > Block level CRCs in HDFS > ------------------------ > > Key: HADOOP-1134 > URL: https://issues.apache.org/jira/browse/HADOOP-1134 > Project: Hadoop > Issue Type: New Feature > Components: dfs > Reporter: Raghu Angadi > Assigned To: Raghu Angadi > > Currently CRCs are handled at FileSystem level and are transparent to core > HDFS. See recent improvement HADOOP-928 ( that can add checksums to a given > filesystem ) regd more about it. Though this served us well there a few > disadvantages : > 1) This doubles namespace in HDFS ( or other filesystem implementations ). In > many cases, it nearly doubles the number of blocks. Taking namenode out of > CRCs would nearly double namespace performance both in terms of CPU and > memory. > 2) Since CRCs are transparent to HDFS, it can not actively detect corrupted > blocks. With block level CRCs, Datanode can periodically verify the checksums > and report corruptions to namnode such that name replicas can be created. > We propose to have CRCs maintained for all HDFS data in much the same way as > in GFS. I will update the jira with detailed requirements and design. This > will include same guarantees provided by current implementation and will > include a upgrade of current data. > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.