[
https://issues.apache.org/jira/browse/HADOOP-3941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12627146#action_12627146
]
Tsz Wo (Nicholas), SZE commented on HADOOP-3941:
------------------------------------------------
> Distcp should not hardwire any algorithm
That is true. We might need a method for getting the supported algorithms of a
file system. Algorithms will be sorted by the preference. For example, if S3
supports {MD5, FileLength}, HDFS supports {HDFS-Checksum, FileLength} and
LocalFS supports {MD5, HDFS-Checksum, FileLength}, then
- S3 -> HDFS or HDFS -> S3 will use FileLength
- S3 -> S3 will use MD5
- S3 -> LocalFS will use MD5
- LocalFS -> HDFS will use HDFS-Checksum
> Extend FileSystem API to return file-checksums/file-digests
> -----------------------------------------------------------
>
> Key: HADOOP-3941
> URL: https://issues.apache.org/jira/browse/HADOOP-3941
> Project: Hadoop Core
> Issue Type: New Feature
> Components: fs
> Reporter: Tsz Wo (Nicholas), SZE
> Assignee: Tsz Wo (Nicholas), SZE
> Attachments: 3941_20080818.patch, 3941_20080819.patch,
> 3941_20080819b.patch, 3941_20080820.patch, 3941_20080826.patch,
> 3941_20080827.patch
>
>
> Suppose we have two files in two locations (may be two clusters) and these
> two files have the same size. How could we tell whether the content of them
> are the same?
> Currently, the only way is to read both files and compare the content of
> them. This is a very expensive operation if the files are huge.
> So, we would like to extend the FileSystem API to support returning
> file-checksums/file-digests.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.