[
https://issues.apache.org/jira/browse/HDFS-8430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082130#comment-15082130
]
Kai Zheng commented on HDFS-8430:
---------------------------------
Hi [~szetszwo], with a fresh mind I looked at your new algorithms again. I
thought the beauty of the cell level checksum ({{New Algorithm 2}}) can solve
the {{block size}} concern I mentioned before and can be sure to generate the
same file checksum result given the same {{cell size}} setting used by the
striped file is also used for the replicated files. Given a striped file and
replicated file that are of the same data, using the same cell size, regardless
their block size setting, the both can be divided into the same set of cells,
so the same cell checksums, and the same file checksums. The left question is,
the cells of the both files are differently laid out in order, but even the
linear code would require the same order to aggregate, we must collect cells
together either for the replicated file or for the striped file to align with
the other. As you said, we can do it in client side or DataNode side. IMO, it
sounds a little heavy if we do it in DataNode side as it involves much effort
for this small function. Are you OK if we do it in client side? Using CRC64 the
network traffic would be decreased than MD5 (16 bytes out). And we're adding a
new API for this behavior.
So to summarize:
* First, add a new API like {{getFileChecksum(int cell)}} using the {{New
Algorithm 2}}. Using this API users can compare a replicated file with a
striped file, and if the file content are the same, the file checksums will be
the same. This version may incur bigger network traffic as it needs to collect
cells into client side for the computing.
* Second, still change the existing API {{getFileChecksum()}} (no args) for
striped files, using the algorithm that specific to striped files, but similar
to existing one for the replicated files. No CRCs data will be collected
centrally so no bigger network traffic involved as the new API does. As the
block layouts are different, the results will differ if it's used to compare a
striped file against a replicated file. It can be used to compare two files
that are of the same layout, either replicated or striped.
* {{distcp}} will be updated to favor the new APIs and use the two APIs
appropriately.
* In this way, I guess it can make everybody happy?
Would you help clarify, correct or confirm? Thanks again!
> Erasure coding: update DFSClient.getFileChecksum() logic for stripe files
> -------------------------------------------------------------------------
>
> Key: HDFS-8430
> URL: https://issues.apache.org/jira/browse/HDFS-8430
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Affects Versions: HDFS-7285
> Reporter: Walter Su
> Assignee: Kai Zheng
> Attachments: HDFS-8430-poc1.patch
>
>
> HADOOP-3981 introduces a distributed file checksum algorithm. It's designed
> for replicated block.
> {{DFSClient.getFileChecksum()}} need some updates, so it can work for striped
> block group.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)