[ 
https://issues.apache.org/jira/browse/HDFS-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16390170#comment-16390170
 ] 

Ajay Kumar commented on HDFS-13056:
-----------------------------------

Hi [~dennishuo], Below test case fails with latest patch.
# hdfs client creates a new file with default "dfs.bytes-per-checksum" and 
"dfs.checksum.combine.mode=COMPOSITE_CRC".
# Client appends same files with different value of "dfs.bytes-per-checksum" 
and {{CreateFlag.NEW_BLOCK}}  flag.
# Fetch the checksum for this given file.
Checksum retrievel fails with below exception:
{code}java.io.IOException: Byte-per-checksum not matched: bpc=256 but 
bytesPerCRC=128
        at 
org.apache.hadoop.hdfs.FileChecksumHelper$ReplicatedFileChecksumComputer.tryDatanode(FileChecksumHelper.java:490)
        at 
org.apache.hadoop.hdfs.FileChecksumHelper$ReplicatedFileChecksumComputer.checksumBlock(FileChecksumHelper.java:421)
        at 
org.apache.hadoop.hdfs.FileChecksumHelper$ReplicatedFileChecksumComputer.checksumBlocks(FileChecksumHelper.java:394)
        at 
org.apache.hadoop.hdfs.FileChecksumHelper$FileChecksumComputer.compute(FileChecksumHelper.java:251)
        at 
org.apache.hadoop.hdfs.DFSClient.getFileChecksumInternal(DFSClient.java:1778)
        at 
org.apache.hadoop.hdfs.DFSClient.getFileChecksumWithCombineMode(DFSClient.java:1797)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem$33.doCall(DistributedFileSystem.java:1683)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem$33.doCall(DistributedFileSystem.java:1680)
        at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileChecksum(DistributedFileSystem.java:1692)
        at 
org.apache.hadoop.fs.shell.Display$Checksum.processPath(Display.java:193)
        at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:331)
        at 
org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:303)
        at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:285)
        at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:269)
        at 
org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:120)
        at org.apache.hadoop.fs.shell.Command.run(Command.java:176)
        at org.apache.hadoop.fs.FsShell.run(FsShell.java:328)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
        at org.apache.hadoop.fs.FsShell.main(FsShell.java:391)
checksum: Fail to get block MD5 for 
LocatedBlock{BP-589677616-192.168.1.149-1520443372952:blk_1073743641_5922; 
getBlockSize()=89; corrupt=false; offset=275048; 
locs=[DatanodeInfoWithStorage[127.0.0.1:9866,D{code}
 Seems this is expected scenario in patch but this breaks use case when client 
appends an hdfs file with different value of {{dfs.bytes-per-checksum}}. 

> Expose file-level composite CRCs in HDFS which are comparable across 
> different instances/layouts
> ------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-13056
>                 URL: https://issues.apache.org/jira/browse/HDFS-13056
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: datanode, distcp, erasure-coding, federation, hdfs
>    Affects Versions: 3.0.0
>            Reporter: Dennis Huo
>            Assignee: Dennis Huo
>            Priority: Major
>         Attachments: HDFS-13056-branch-2.8.001.patch, 
> HDFS-13056-branch-2.8.002.patch, HDFS-13056-branch-2.8.003.patch, 
> HDFS-13056-branch-2.8.004.patch, HDFS-13056-branch-2.8.poc1.patch, 
> HDFS-13056.001.patch, HDFS-13056.002.patch, HDFS-13056.003.patch, 
> HDFS-13056.003.patch, HDFS-13056.004.patch, HDFS-13056.005.patch, 
> HDFS-13056.006.patch, HDFS-13056.007.patch, 
> Reference_only_zhen_PPOC_hadoop2.6.X.diff, hdfs-file-composite-crc32-v1.pdf, 
> hdfs-file-composite-crc32-v2.pdf, hdfs-file-composite-crc32-v3.pdf
>
>
> FileChecksum was first introduced in 
> [https://issues-test.apache.org/jira/browse/HADOOP-3981] and ever since then 
> has remained defined as MD5-of-MD5-of-CRC, where per-512-byte chunk CRCs are 
> already stored as part of datanode metadata, and the MD5 approach is used to 
> compute an aggregate value in a distributed manner, with individual datanodes 
> computing the MD5-of-CRCs per-block in parallel, and the HDFS client 
> computing the second-level MD5.
>  
> A shortcoming of this approach which is often brought up is the fact that 
> this FileChecksum is sensitive to the internal block-size and chunk-size 
> configuration, and thus different HDFS files with different block/chunk 
> settings cannot be compared. More commonly, one might have different HDFS 
> clusters which use different block sizes, in which case any data migration 
> won't be able to use the FileChecksum for distcp's rsync functionality or for 
> verifying end-to-end data integrity (on top of low-level data integrity 
> checks applied at data transfer time).
>  
> This was also revisited in https://issues.apache.org/jira/browse/HDFS-8430 
> during the addition of checksum support for striped erasure-coded files; 
> while there was some discussion of using CRC composability, it still 
> ultimately settled on hierarchical MD5 approach, which also adds the problem 
> that checksums of basic replicated files are not comparable to striped files.
>  
> This feature proposes to add a "COMPOSITE-CRC" FileChecksum type which uses 
> CRC composition to remain completely chunk/block agnostic, and allows 
> comparison between striped vs replicated files, between different HDFS 
> instances, and possible even between HDFS and other external storage systems. 
> This feature can also be added in-place to be compatible with existing block 
> metadata, and doesn't need to change the normal path of chunk verification, 
> so is minimally invasive. This also means even large preexisting HDFS 
> deployments could adopt this feature to retroactively sync data. A detailed 
> design document can be found here: 
> https://storage.googleapis.com/dennishuo/hdfs-file-composite-crc32-v1.pdf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to