[ 
https://issues.apache.org/jira/browse/HDFS-8430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082130#comment-15082130
 ] 

Kai Zheng commented on HDFS-8430:
---------------------------------

Hi [~szetszwo], with a fresh mind I looked at your new algorithms again. I 
thought the beauty of the cell level checksum ({{New Algorithm 2}}) can solve 
the {{block size}} concern I mentioned before and can be sure to generate the 
same file checksum result given the same {{cell size}} setting used by the 
striped file is also used for the replicated files. Given a striped file and 
replicated file that are of the same data, using the same cell size, regardless 
their block size setting, the both can be divided into the same set of cells, 
so the same cell checksums, and the same file checksums. The left question is, 
the cells of the both files are differently laid out in order, but even the 
linear code would require the same order to aggregate, we must collect cells 
together either for the replicated file or for the striped file to align with 
the other. As you said, we can do it in client side or DataNode side. IMO, it 
sounds a little heavy if we do it in DataNode side as it involves much effort 
for this small function. Are you OK if we do it in client side? Using CRC64 the 
network traffic would be decreased than MD5 (16 bytes out). And we're adding a 
new API for this behavior. 

So to summarize:
* First, add a new API like {{getFileChecksum(int cell)}} using the {{New 
Algorithm 2}}. Using this API users can compare a replicated file with a 
striped file, and if the file content are the same, the file checksums will be 
the same. This version may incur bigger network traffic as it needs to collect 
cells into client side for the computing.
* Second, still change the existing API {{getFileChecksum()}} (no args) for 
striped files, using the algorithm that specific to striped files, but similar 
to existing one for the replicated files. No CRCs data will be collected 
centrally so no bigger network traffic involved as the new API does. As the 
block layouts are different, the results will differ if it's used to compare a 
striped file against a replicated file. It can be used to compare two files 
that are of the same layout, either replicated or striped.
* {{distcp}} will be updated to favor the new APIs and use the two APIs 
appropriately.
* In this way, I guess it can make everybody happy?

Would you help clarify, correct or confirm? Thanks again!

> Erasure coding: update DFSClient.getFileChecksum() logic for stripe files
> -------------------------------------------------------------------------
>
>                 Key: HDFS-8430
>                 URL: https://issues.apache.org/jira/browse/HDFS-8430
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>    Affects Versions: HDFS-7285
>            Reporter: Walter Su
>            Assignee: Kai Zheng
>         Attachments: HDFS-8430-poc1.patch
>
>
> HADOOP-3981 introduces a  distributed file checksum algorithm. It's designed 
> for replicated block.
> {{DFSClient.getFileChecksum()}} need some updates, so it can work for striped 
> block group.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to