[ 
https://issues.apache.org/jira/browse/HADOOP-12326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15145849#comment-15145849
 ] 

Ming Ma commented on HADOOP-12326:
----------------------------------

Thanks [~jira.shegalov] for this useful feature.

* To compare local file with HDFS file, besides matching block file size, it 
seems bytes-per-checksum values of local file and HDFS file also need to match. 
For example, if an existing HDFS file was create with 
{{dfs.bytes-per-checksum}} set to 1024, then to check on its local file, the 
command line need to have option like {{hadoop fs 
-Dfile.bytes-per-checksum=1024 -checksum file:///...}} so that a new crc file 
can be created and used if the existing crc file doesn't have the same 
bytes-per-checksum value as requested.
* HDFS supports variable length block after 
https://issues.apache.org/jira/browse/HDFS-3689. So if a HDFS file has 
partially filled block in the middle, the file checksum could be different from 
the local file checksum.

These aren't common scenarios. But want to bring them up to make sure I 
understand them correctly and if so we can either call out these scenarios in 
the doc or maybe fix them in this jira or separately.

> Implement ChecksumFileSystem#getFileChecksum equivalent to HDFS for easy check
> ------------------------------------------------------------------------------
>
>                 Key: HADOOP-12326
>                 URL: https://issues.apache.org/jira/browse/HADOOP-12326
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs
>    Affects Versions: 2.7.1
>            Reporter: Gera Shegalov
>            Assignee: Gera Shegalov
>         Attachments: HADOOP-12326.001.patch, HADOOP-12326.002.patch, 
> HADOOP-12326.003.patch, HADOOP-12326.004.patch, HADOOP-12326.005.patch, 
> HADOOP-12326.007.patch
>
>
> If we have same-content files, one local and one remotely on HDFS (after 
> downloading or uploading), getFileChecksum can provide a quick check whether 
> they are consistent.  To this end, we can switch to CRC32C on local 
> filesystem. The difference in block sizes does not matter, because for the 
> local filesystem it's just a logical parameter.
> {code}
> $ hadoop fs -Dfs.local.block.size=134217728 -checksum 
> file:${PWD}/part-m-00000 part-m-00000
> 15/08/15 13:30:02 WARN util.NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> file:///Users/gshegalov/workspace/hadoop-common/part-m-00000  
> MD5-of-262144MD5-of-512CRC32C   
> 000002000000000000040000e84fb07f8c9d4ef3acb5d1983a7e2a68
> part-m-00000  MD5-of-262144MD5-of-512CRC32C   
> 000002000000000000040000e84fb07f8c9d4ef3acb5d1983a7e2a68
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to