[
https://issues.apache.org/jira/browse/HADOOP-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629237#action_12629237
]
Tsz Wo (Nicholas), SZE commented on HADOOP-3981:
------------------------------------------------
> Perhaps we can use bytes.per.checksum in the algorithm name, e.g.,
> MD5-of-CRC32-every-512bytes
+1 We definitely need to encode these details in the algorithm name.
> Another API to consider ...
Which API are you talking about, FileSystem API or HDFS API? If you mean HDFS
API, are you saying that we should handle HDFS specially in DistCp? Currently,
DistCp only uses FileSystem API.
> Need a distributed file checksum algorithm for HDFS
> ---------------------------------------------------
>
> Key: HADOOP-3981
> URL: https://issues.apache.org/jira/browse/HADOOP-3981
> Project: Hadoop Core
> Issue Type: New Feature
> Components: dfs
> Reporter: Tsz Wo (Nicholas), SZE
>
> Traditional message digest algorithms, like MD5, SHA1, etc., require reading
> the entire input message sequentially in a central location. HDFS supports
> large files with multiple tera bytes. The overhead of reading the entire
> file is huge. A distributed file checksum algorithm is needed for HDFS.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.