[
https://issues.apache.org/jira/browse/HADOOP-3941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12626699#action_12626699
]
Tsz Wo (Nicholas), SZE commented on HADOOP-3941:
------------------------------------------------
> Do we expect a FileSystem to actually checksum a file on demand? I assume
> not, that this feature is primarily for accessing pre-computed checksums, ...
For HDFS, I am not sure whether sending all crcs to client is good enough since
the size of all crcs is 1/128 of the file size, which is big for large files.
We might want to reduce the network traffic (especially in the case of distcp)
by computing a second level of checksums (e.g. compute a MD5 for all the crcs
of a block). So, I think this feature is not only for accessing pre-computed
checksums, but indeed a framework for supporting checksum algorithms.
> In (2) copies should use flie lengths or perhaps fail, ...
It should not fail. Otherwise, we cannot copy from local fs to hdfs. We are
currently using file length as checksum. It is simply too easy to have false
positive.
> In any case, hardwiring distcp to use FileLengthChecksum doesn't seem like an
> improvement.
It is only temporary. Once we have a distributed checksum implementation, we
could change DistCp to use it. The distributed checksum implementation will
optimize for HDFS, so that coping from HDFS to HDFS will be very efficient
(which is the main purpose of distcp). If necessary, we could provide an
option in distcp for users to specify the checksum algorithm.
> Extend FileSystem API to return file-checksums/file-digests
> -----------------------------------------------------------
>
> Key: HADOOP-3941
> URL: https://issues.apache.org/jira/browse/HADOOP-3941
> Project: Hadoop Core
> Issue Type: New Feature
> Components: fs
> Reporter: Tsz Wo (Nicholas), SZE
> Assignee: Tsz Wo (Nicholas), SZE
> Attachments: 3941_20080818.patch, 3941_20080819.patch,
> 3941_20080819b.patch, 3941_20080820.patch, 3941_20080826.patch,
> 3941_20080827.patch
>
>
> Suppose we have two files in two locations (may be two clusters) and these
> two files have the same size. How could we tell whether the content of them
> are the same?
> Currently, the only way is to read both files and compare the content of
> them. This is a very expensive operation if the files are huge.
> So, we would like to extend the FileSystem API to support returning
> file-checksums/file-digests.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.