I actually had to pull this code out for a project about two years ago
(we had to 2 hop the files due to some security issues, and the sender
wanted to know if the file got to hdfs "intact") ---

I've pulled out the code into a side project:

https://github.com/jpatanooga/IvoryMonkey

more specifically:

https://github.com/jpatanooga/IvoryMonkey/blob/master/src/tv/floe/IvoryMonkey/hadoop/fs/ExternalHDFSChecksumGenerator.java

What its actually doing is saving the CRC32 of every 512 bytes (per
block) and then doing a MD5 hash on that. Then when the
"getFileChecksum()" method is called, each block for a file sends its
md5 hash to a collector which are gathered together and a md5 hash is
calc'd for all of the block hashes.

My version includes code that can calculate the hash on the client
side (but breaks things up in the same way that hdfs does and will
calc it the same way).

During development, we also discovered and filed:

https://issues.apache.org/jira/browse/HDFS-772

To invoke this method, use my shell wrapper:

https://github.com/jpatanooga/IvoryMonkey/blob/master/src/tv/floe/IvoryMonkey/hadoop/fs/Shell.java

Josh



On Wed, Mar 2, 2011 at 1:00 PM, Philip Zeyliger <phi...@cloudera.com> wrote:
> The FileSytem API exposes a getFileChecksum() method
>  http://archive.cloudera.com/cdh/3/hadoop-0.20.2-CDH3B4/api/org/apache/hadoop/fs/FileSystem.html#getFileChecksum(org.apache.hadoop.fs.Path) too.
>  However, this isn't the straight-up MD5 of the file.
> [1]monster01::groovy-1.7.8(12813)$CLASSPATH=$(hadoop classpath) bin/groovysh
> groovy:000> import org.apache.hadoop.fs.FileSystem
> ===> [import org.apache.hadoop.fs.FileSystem]
> groovy:000> import org.apache.hadoop.conf.Configuration
> ===> [import org.apache.hadoop.fs.FileSystem, import
> org.apache.hadoop.conf.Configuration]
> groovy:000> import org.apache.hadoop.fs.Path
> ===> [import org.apache.hadoop.fs.FileSystem, import
> org.apache.hadoop.conf.Configuration, import org.apache.hadoop.fs.Path]
> groovy:000> fs = FileSystem.get(new Configuration())
> ===> DFS[DFSClient[clientName=DFSClient_-919136415, ugi=philip]]
> groovy:000> fs
> ===> DFS[DFSClient[clientName=DFSClient_-919136415, ugi=philip]]
> groovy:000> fs.getFileChecksum(new Path("/tmp/issue"))
> ===> MD5-of-0MD5-of-512CRC32:eeec9870219b2381f99ac8ea0c2d0d60
> groovy:000>
> Whereas:
> [0]monster01::~(12845)$hadoop fs -put /etc/issue /tmp/issue
> [1]monster01::~(12844)$md5sum /etc/issue
> 6c9222ee501323045d85545853ebea55  /etc/issue
>
>
> On Wed, Mar 2, 2011 at 9:47 AM, <stu24m...@yahoo.com> wrote:
>>
>> I don't think there is a built in command. I would just use the java or
>> thrift api to read the file & calculate the hash. (thrift + python/ruby/etc)
>>
>> Take care,
>>  -stu
>> -----Original Message-----
>> From: Scott Golby <sgo...@conductor.com>
>> Date: Wed, 2 Mar 2011 11:05:04
>> To: hdfs-user@hadoop.apache.org<hdfs-user@hadoop.apache.org>
>> Reply-To: hdfs-user@hadoop.apache.org
>> Subject: md5sum of files on HDFS ?
>>
>> Hi Everyone,
>>
>> How can I do a md5sum/sha1sum directly against files on HDFS ?
>>
>>
>> A pretty common thing I do when archiving files is make an md5sum list
>>
>> eg)  md5sum /archive/path/* > md5sum-list.txt
>>
>> Then later should I need to check the files are ok, perhaps before a
>> restore, or when I copy them to somewhere else I'll do
>> md5sum -c md5sum-list.txt
>>
>>
>> I'd be ok doing it 1 file at a time
>>
>> java -jar <something> hdfs://some/path/in-hadoop/filename
>>
>>
>> I'm also ok doing it serially through a single node, I've been doing some
>> googling and JIRA ticket reading such as
>> https://issues.apache.org/jira/browse/HADOOP-3981 and for my use case serial
>> read is not a limitation.
>>
>> What is a bit of a requirement is something I can run as a standard linux
>> command on local disk an do 1:1 output comparison.
>> eg) Check HDFS md5sum on a file, copyToLocal, re-check local disk md5sum.
>>
>> Thanks,
>> Scott
>>
>>
>>
>>
>>
>>
>
>



-- 
Twitter: @jpatanooga
Solution Architect @ Cloudera
hadoop: http://www.cloudera.com
blog: http://jpatterson.floe.tv

Reply via email to