[
https://issues.apache.org/jira/browse/HADOOP-3941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12626614#action_12626614
]
Doug Cutting commented on HADOOP-3941:
--------------------------------------
I don't see the point in passing the checksum algorithm name to
getFileChecksum(). Do we expect a FileSystem to actually checksum a file on
demand? I assume not, that this feature is primarily for accessing
pre-computed checksums, and that most filesystems will only support a single
checksum algorithm.
There are two primary cases to consider:
1. Copying files between filesystems that have pre-computed checksums using
the same algorithm.
2. Copying files between filesystems which either do not have pre-computed
checksums or use different algorithms.
In (2) copies should use flie lengths or perhaps fail, and in (1) we should use
checksums. Right?
In any case, hardwiring distcp to use FileLengthChecksum doesn't seem like an
improvement.
> Extend FileSystem API to return file-checksums/file-digests
> -----------------------------------------------------------
>
> Key: HADOOP-3941
> URL: https://issues.apache.org/jira/browse/HADOOP-3941
> Project: Hadoop Core
> Issue Type: New Feature
> Components: fs
> Reporter: Tsz Wo (Nicholas), SZE
> Assignee: Tsz Wo (Nicholas), SZE
> Attachments: 3941_20080818.patch, 3941_20080819.patch,
> 3941_20080819b.patch, 3941_20080820.patch, 3941_20080826.patch,
> 3941_20080827.patch
>
>
> Suppose we have two files in two locations (may be two clusters) and these
> two files have the same size. How could we tell whether the content of them
> are the same?
> Currently, the only way is to read both files and compare the content of
> them. This is a very expensive operation if the files are huge.
> So, we would like to extend the FileSystem API to support returning
> file-checksums/file-digests.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.