[ 
https://issues.apache.org/jira/browse/HDFS-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16347851#comment-16347851
 ] 

Dennis Huo commented on HDFS-13056:
-----------------------------------

Uploaded initial end-to-end working draft against trunk which supports 
CRC32/CRC32C and partial file prefixes of arbitrary bytes-per-crc or blocksize 
and across replicated vs striped encodings as well.

Still a TODO to support the striped-reconstruction path, and adding stripe 
support made everything a lot messier so some refactoring is in order. Also, 
unittests still pending, but manual testing in a real setup works:

 

 
{code:java}
$ hdfs dfs -cp gs://hadoop-cloud-dev-dhuo/random-crctest.dat 
hdfs:///tmp/random-crctest-default1.dat
$ hdfs dfs -cp gs://hadoop-cloud-dev-dhuo/random-crctest.dat 
hdfs:///tmp/random-crctest-default2.dat
$ hdfs dfs -Ddfs.bytes-per-checksum=1024 -cp 
gs://hadoop-cloud-dev-dhuo/random-crctest.dat 
hdfs:///tmp/random-crctest-bpc1024.dat
$ hdfs dfs -Ddfs.blocksize=67108864 -cp 
gs://hadoop-cloud-dev-dhuo/random-crctest.dat 
hdfs:///tmp/random-crctest-blocksize64mb.dat
$ hdfs dfs -cp gs://hadoop-cloud-dev-dhuo/random-crctest-unaligned.dat 
hdfs:///tmp/random-crctest-unaligned1.dat
$ hdfs dfs -Ddfs.bytes-per-checksum=1024 -cp 
gs://hadoop-cloud-dev-dhuo/random-crctest-unaligned.dat 
hdfs:///tmp/random-crctest-unaligned2.dat
$ hdfs dfs -Ddfs.checksum.type=CRC32 -cp 
gs://hadoop-cloud-dev-dhuo/random-crctest.dat 
hdfs:///tmp/random-crctest-gzipcrc32-1.dat
$ hdfs dfs -Ddfs.checksum.type=CRC32 -Ddfs.bytes-per-checksum=1024 -cp 
gs://hadoop-cloud-dev-dhuo/random-crctest.dat 
hdfs:///tmp/random-crctest-gzipcrc32-2.dat


$ hdfs dfs -mkdir hdfs:///tmpec
$ hdfs ec -enablePolicy -policy XOR-2-1-1024k
$ hdfs ec -setPolicy -path hdfs:///tmpec -policy XOR-2-1-1024k


$ hdfs dfs -cp gs://hadoop-cloud-dev-dhuo/random-crctest.dat 
hdfs:///tmpec/random-crctest-default1.dat
$ hdfs dfs -cp gs://hadoop-cloud-dev-dhuo/random-crctest.dat 
hdfs:///tmpec/random-crctest-default2.dat
$ hdfs dfs -Ddfs.bytes-per-checksum=1024 -cp 
gs://hadoop-cloud-dev-dhuo/random-crctest.dat 
hdfs:///tmpec/random-crctest-bpc1024.dat
$ hdfs dfs -Ddfs.blocksize=67108864 -cp 
gs://hadoop-cloud-dev-dhuo/random-crctest.dat 
hdfs:///tmpec/random-crctest-blocksize64mb.dat
$ hdfs dfs -cp gs://hadoop-cloud-dev-dhuo/random-crctest-unaligned.dat 
hdfs:///tmpec/random-crctest-unaligned1.dat
$ hdfs dfs -Ddfs.bytes-per-checksum=1024 -cp 
gs://hadoop-cloud-dev-dhuo/random-crctest-unaligned.dat 
hdfs:///tmpec/random-crctest-unaligned2.dat
$ hdfs dfs -Ddfs.checksum.type=CRC32 -cp 
gs://hadoop-cloud-dev-dhuo/random-crctest.dat 
hdfs:///tmpec/random-crctest-gzipcrc32-1.dat
$ hdfs dfs -Ddfs.checksum.type=CRC32 -Ddfs.bytes-per-checksum=1024 -cp 
gs://hadoop-cloud-dev-dhuo/random-crctest.dat 
hdfs:///tmpec/random-crctest-gzipcrc32-2.dat

$ hdfs dfs -checksum hdfs:///tmp/random-crctest*.dat
hdfs:///tmp/random-crctest-blocksize64mb.dat    MD5-of-131072MD5-of-512CRC32C   
0000020000000000000200008baa940ef6ed21fb4bd6224ce917d127
hdfs:///tmp/random-crctest-bpc1024.dat  MD5-of-131072MD5-of-1024CRC32C  
000004000000000000020000930b0d7ad333786a839b044ed8d18d2d
hdfs:///tmp/random-crctest-default1.dat MD5-of-262144MD5-of-512CRC32C   
000002000000000000040000c0baeeacbc4b5a3c8af5152944fe2d79
hdfs:///tmp/random-crctest-default2.dat MD5-of-262144MD5-of-512CRC32C   
000002000000000000040000c0baeeacbc4b5a3c8af5152944fe2d79
hdfs:///tmp/random-crctest-gzipcrc32-1.dat      MD5-of-262144MD5-of-512CRC32    
00000200000000000004000049d52fdd25aa08559e20536acc34d51d
hdfs:///tmp/random-crctest-gzipcrc32-2.dat      MD5-of-131072MD5-of-1024CRC32   
0000040000000000000200001d5468ea4093ddb3741790b8dc3b9a57
hdfs:///tmp/random-crctest-unaligned1.dat       MD5-of-262144MD5-of-512CRC32C   
0000020000000000000400000da665dadca0df00456206f234d5f8b0
hdfs:///tmp/random-crctest-unaligned2.dat       MD5-of-131072MD5-of-1024CRC32C  
00000400000000000002000027c2198f48224a0ddb92c4dc4addd28b

$ hdfs dfs -checksum hdfs:///tmpec/random-crctest*.dat
18/02/01 01:15:54 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 
1.6.2-hadoop2
hdfs:///tmpec/random-crctest-blocksize64mb.dat  MD5-of-131072MD5-of-512CRC32C   
0000020000000000000200005b54faaa368ed81b25984a746c767d39
hdfs:///tmpec/random-crctest-bpc1024.dat        MD5-of-131072MD5-of-1024CRC32C  
00000400000000000002000089a128b1e1995256bdb34fb95720dafc
hdfs:///tmpec/random-crctest-default1.dat       MD5-of-262144MD5-of-512CRC32C   
00000200000000000004000007ee18e8f4909647adf085ec0f464d1a
hdfs:///tmpec/random-crctest-default2.dat       MD5-of-262144MD5-of-512CRC32C   
00000200000000000004000007ee18e8f4909647adf085ec0f464d1a
hdfs:///tmpec/random-crctest-gzipcrc32-1.dat    MD5-of-262144MD5-of-512CRC32    
000002000000000000040000d79ad1fa00fad2f0adb18f49f2e90bb3
hdfs:///tmpec/random-crctest-gzipcrc32-2.dat    MD5-of-131072MD5-of-1024CRC32   
000004000000000000020000126ac7bc467c59942734bd8ebf690440
hdfs:///tmpec/random-crctest-unaligned1.dat     MD5-of-262144MD5-of-512CRC32C   
0000020000000000000400004b95df26144cba3d1a0ab87cea048b66
hdfs:///tmpec/random-crctest-unaligned2.dat     MD5-of-131072MD5-of-1024CRC32C  
000004000000000000020000c8b50f1216f55608975624f6a34542bc

$ hdfs dfs -Ddfs.checksum.combine.mode=COMPOSITE_CRC -checksum 
hdfs:///tmp/random-crctest*.dat
18/02/01 01:15:57 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 
1.6.2-hadoop2
hdfs:///tmp/random-crctest-blocksize64mb.dat    COMPOSITE-CRC32C        
4db86e2b00000000000000000000000000000000000000000000000000000000
hdfs:///tmp/random-crctest-bpc1024.dat  COMPOSITE-CRC32C        
4db86e2b00000000000000000000000000000000000000000000000000000000
hdfs:///tmp/random-crctest-default1.dat COMPOSITE-CRC32C        
4db86e2b00000000000000000000000000000000000000000000000000000000
hdfs:///tmp/random-crctest-default2.dat COMPOSITE-CRC32C        
4db86e2b00000000000000000000000000000000000000000000000000000000
hdfs:///tmp/random-crctest-gzipcrc32-1.dat      COMPOSITE-CRC32 
721d687e00000000000000000000000000000000000000000000000000000000
hdfs:///tmp/random-crctest-gzipcrc32-2.dat      COMPOSITE-CRC32 
721d687e00000000000000000000000000000000000000000000000000000000
hdfs:///tmp/random-crctest-unaligned1.dat       COMPOSITE-CRC32C        
c3842f6100000000000000000000000000000000000000000000000000000000
hdfs:///tmp/random-crctest-unaligned2.dat       COMPOSITE-CRC32C        
c3842f6100000000000000000000000000000000000000000000000000000000

$ hdfs dfs -Ddfs.checksum.combine.mode=COMPOSITE_CRC -checksum 
hdfs:///tmpec/random-crctest*.dat
18/02/01 01:16:00 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 
1.6.2-hadoop2
hdfs:///tmpec/random-crctest-blocksize64mb.dat  COMPOSITE-CRC32C        
4db86e2b00000000000000000000000000000000000000000000000000000000
hdfs:///tmpec/random-crctest-bpc1024.dat        COMPOSITE-CRC32C        
4db86e2b00000000000000000000000000000000000000000000000000000000
hdfs:///tmpec/random-crctest-default1.dat       COMPOSITE-CRC32C        
4db86e2b00000000000000000000000000000000000000000000000000000000
hdfs:///tmpec/random-crctest-default2.dat       COMPOSITE-CRC32C        
4db86e2b00000000000000000000000000000000000000000000000000000000
hdfs:///tmpec/random-crctest-gzipcrc32-1.dat    COMPOSITE-CRC32 
721d687e00000000000000000000000000000000000000000000000000000000
hdfs:///tmpec/random-crctest-gzipcrc32-2.dat    COMPOSITE-CRC32 
721d687e00000000000000000000000000000000000000000000000000000000
hdfs:///tmpec/random-crctest-unaligned1.dat     COMPOSITE-CRC32C        
c3842f6100000000000000000000000000000000000000000000000000000000
hdfs:///tmpec/random-crctest-unaligned2.dat     COMPOSITE-CRC32C        
c3842f6100000000000000000000000000000000000000000000000000000000

{code}
 

> Expose file-level composite CRCs in HDFS which are comparable across 
> different instances/layouts
> ------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-13056
>                 URL: https://issues.apache.org/jira/browse/HDFS-13056
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: datanode, distcp, erasure-coding, federation, hdfs
>    Affects Versions: 3.0.0
>            Reporter: Dennis Huo
>            Priority: Major
>         Attachments: HDFS-13056-branch-2.8.001.patch, 
> HDFS-13056-branch-2.8.poc1.patch, HDFS-13056.001.patch, 
> Reference_only_zhen_PPOC_hadoop2.6.X.diff, hdfs-file-composite-crc32-v1.pdf, 
> hdfs-file-composite-crc32-v2.pdf
>
>
> FileChecksum was first introduced in 
> [https://issues-test.apache.org/jira/browse/HADOOP-3981] and ever since then 
> has remained defined as MD5-of-MD5-of-CRC, where per-512-byte chunk CRCs are 
> already stored as part of datanode metadata, and the MD5 approach is used to 
> compute an aggregate value in a distributed manner, with individual datanodes 
> computing the MD5-of-CRCs per-block in parallel, and the HDFS client 
> computing the second-level MD5.
>  
> A shortcoming of this approach which is often brought up is the fact that 
> this FileChecksum is sensitive to the internal block-size and chunk-size 
> configuration, and thus different HDFS files with different block/chunk 
> settings cannot be compared. More commonly, one might have different HDFS 
> clusters which use different block sizes, in which case any data migration 
> won't be able to use the FileChecksum for distcp's rsync functionality or for 
> verifying end-to-end data integrity (on top of low-level data integrity 
> checks applied at data transfer time).
>  
> This was also revisited in https://issues.apache.org/jira/browse/HDFS-8430 
> during the addition of checksum support for striped erasure-coded files; 
> while there was some discussion of using CRC composability, it still 
> ultimately settled on hierarchical MD5 approach, which also adds the problem 
> that checksums of basic replicated files are not comparable to striped files.
>  
> This feature proposes to add a "COMPOSITE-CRC" FileChecksum type which uses 
> CRC composition to remain completely chunk/block agnostic, and allows 
> comparison between striped vs replicated files, between different HDFS 
> instances, and possible even between HDFS and other external storage systems. 
> This feature can also be added in-place to be compatible with existing block 
> metadata, and doesn't need to change the normal path of chunk verification, 
> so is minimally invasive. This also means even large preexisting HDFS 
> deployments could adopt this feature to retroactively sync data. A detailed 
> design document can be found here: 
> https://storage.googleapis.com/dennishuo/hdfs-file-composite-crc32-v1.pdf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to