[
https://issues.apache.org/jira/browse/HDFS-8430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15073535#comment-15073535
]
Walter Su commented on HDFS-8430:
---------------------------------
The MD5-of-xxxMD5-of-yyyCRC32 uses existing block metadata, so it depends on
bytes.per.checksum and block size.
Now we have ec feature, so it depends on bytes.per.checksum, block size, and
block layout.
bytes.per.checksum is less likely to change along with version upgrade. I think
2 clusters of DistCp probably have the same bpc.
If we want to make it not affected by block size, and block layout, Client
should get CRCs from DNs, and sum at client side. Well, it changes existing
implementation and increases network traffic. If we can avoid tranfser files
between clusters, I think we can bear the cost.
I haven't started. Please feel free to take it.
> Erasure coding: update DFSClient.getFileChecksum() logic for stripe files
> -------------------------------------------------------------------------
>
> Key: HDFS-8430
> URL: https://issues.apache.org/jira/browse/HDFS-8430
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Affects Versions: HDFS-7285
> Reporter: Walter Su
> Assignee: Walter Su
>
> HADOOP-3981 introduces a distributed file checksum algorithm. It's designed
> for replicated block.
> {{DFSClient.getFileChecksum()}} need some updates, so it can work for striped
> block group.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)