[jira] Commented: (HADOOP-3941) Extend FileSystem API to return file-checksums/file-digests

Tsz Wo (Nicholas), SZE (JIRA) Thu, 28 Aug 2008 12:08:13 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-3941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12626699#action_12626699
 ]


Tsz Wo (Nicholas), SZE commented on HADOOP-3941:
------------------------------------------------

> Do we expect a FileSystem to actually checksum a file on demand? I assume 
> not, that this feature is primarily for accessing pre-computed checksums, ...

For HDFS, I am not sure whether sending all crcs to client is good enough since 
the size of all crcs is 1/128 of the file size, which is big for large files.  
We might want to reduce the network traffic (especially in the case of distcp) 
by computing a second level of checksums (e.g. compute a MD5 for all the crcs 
of a block).  So, I think this feature is not only for accessing pre-computed 
checksums, but indeed a framework for supporting checksum algorithms.

> In (2) copies should use flie lengths or perhaps fail, ...

It should not fail.  Otherwise, we cannot copy from local fs to hdfs.  We are 
currently using file length as checksum.  It is simply too easy to have false 
positive.

> In any case, hardwiring distcp to use FileLengthChecksum doesn't seem like an 
> improvement.

It is only temporary.  Once we have a distributed checksum implementation, we 
could change DistCp to use it.  The distributed checksum implementation will 
optimize for HDFS, so that coping from HDFS to HDFS will be very efficient 
(which is the main purpose of distcp).  If necessary, we could provide an 
option in distcp for users to specify the checksum algorithm.

> Extend FileSystem API to return file-checksums/file-digests
> -----------------------------------------------------------
>
>                 Key: HADOOP-3941
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3941
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: 3941_20080818.patch, 3941_20080819.patch, 
> 3941_20080819b.patch, 3941_20080820.patch, 3941_20080826.patch, 
> 3941_20080827.patch
>
>
> Suppose we have two files in two locations (may be two clusters) and these 
> two files have the same size.  How could we tell whether the content of them 
> are the same?
> Currently, the only way is to read both files and compare the content of 
> them.  This is a very expensive operation if the files are huge.
> So, we would like to extend the FileSystem API to support returning 
> file-checksums/file-digests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3941) Extend FileSystem API to return file-checksums/file-digests

Reply via email to