[
https://issues.apache.org/jira/browse/HDFS-8430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15870992#comment-15870992
]
Kai Zheng commented on HDFS-8430:
---------------------------------
Hi Andrew,
Sorry for the late response.
Quite some time ago [~szetszwo] and I sorted out two approaches for this thru a
long discussion:
{quote}
First, add a new API like getFileChecksum(int cell) using the New Algorithm 2.
Using this API users can compare a replicated file with a striped file, and if
the file content are the same, the file checksums will be the same. This
version may incur bigger network traffic as it needs to collect cells into
client side for the computing.
Second, still change the existing API getFileChecksum() (no args) for striped
files, using the algorithm that specific to striped files, but similar to
existing one for the replicated files. No CRCs data will be collected centrally
so no bigger network traffic involved as the new API does. As the block layouts
are different, the results will differ if it's used to compare a striped file
against a replicated file. It can be used to compare two files that are of the
same layout, either replicated or striped.
{quote}
Sub-tasks of this, HDFS-9694 and HDFS-9833 implemented the {{2nd}} approach,
enhancing the existing API getFileChecksum() (no args) to support striped
files. It can be used to compare two files that are of the same layout, either
replicated or striped. I thought this is good enough so far, for example, the
distcp usage.
The {{1st}} approach can be used to compare a replicated file against a striped
file. It needs non-trivial development work and also involves big network
traffic to centrally compute an aggregate checksum results for a block group.
IMO, we could continue to pend this for explicit user requirement for the
target behavior ({{compare a replicated file against a striped file}}).
So what's your thoughts?
> Erasure coding: compute file checksum for striped files (stripe by stripe)
> --------------------------------------------------------------------------
>
> Key: HDFS-8430
> URL: https://issues.apache.org/jira/browse/HDFS-8430
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Components: erasure-coding
> Affects Versions: HDFS-7285
> Reporter: Walter Su
> Assignee: Kai Zheng
> Priority: Blocker
> Labels: hdfs-ec-3.0-must-do
> Attachments: HDFS-8430-poc1.patch
>
>
> HADOOP-3981 introduces a distributed file checksum algorithm. It's designed
> for replicated block.
> {{DFSClient.getFileChecksum()}} need some updates, so it can work for striped
> block group.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]