Re: md5sum of files on HDFS ?

Philip Zeyliger Wed, 02 Mar 2011 10:01:47 -0800

The FileSytem API exposes a getFileChecksum() method
http://archive.cloudera.com/cdh/3/hadoop-0.20.2-CDH3B4/api/org/apache/hadoop/fs/FileSystem.html#getFileChecksum(org.apache.hadoop.fs.Path)
too.
 However, this isn't the straight-up MD5 of the file.


[1]monster01::groovy-1.7.8(12813)$CLASSPATH=$(hadoop classpath) bin/groovysh
groovy:000> import org.apache.hadoop.fs.FileSystem
===> [import org.apache.hadoop.fs.FileSystem]
groovy:000> import org.apache.hadoop.conf.Configuration
===> [import org.apache.hadoop.fs.FileSystem, import
org.apache.hadoop.conf.Configuration]
groovy:000> import org.apache.hadoop.fs.Path
===> [import org.apache.hadoop.fs.FileSystem, import
org.apache.hadoop.conf.Configuration, import org.apache.hadoop.fs.Path]
groovy:000> fs = FileSystem.get(new Configuration())
===> DFS[DFSClient[clientName=DFSClient_-919136415, ugi=philip]]
groovy:000> fs
===> DFS[DFSClient[clientName=DFSClient_-919136415, ugi=philip]]
groovy:000> fs.getFileChecksum(new Path("/tmp/issue"))
===> MD5-of-0MD5-of-512CRC32:eeec9870219b2381f99ac8ea0c2d0d60
groovy:000>

Whereas:

[0]monster01::~(12845)$hadoop fs -put /etc/issue /tmp/issue
[1]monster01::~(12844)$md5sum /etc/issue
6c9222ee501323045d85545853ebea55  /etc/issue



On Wed, Mar 2, 2011 at 9:47 AM, <stu24m...@yahoo.com> wrote:

> I don't think there is a built in command. I would just use the java or
> thrift api to read the file & calculate the hash. (thrift + python/ruby/etc)
>
> Take care,
>  -stu
> -----Original Message-----
> From: Scott Golby <sgo...@conductor.com>
> Date: Wed, 2 Mar 2011 11:05:04
> To: hdfs-user@hadoop.apache.org<hdfs-user@hadoop.apache.org>
> Reply-To: hdfs-user@hadoop.apache.org
> Subject: md5sum of files on HDFS ?
>
> Hi Everyone,
>
> How can I do a md5sum/sha1sum directly against files on HDFS ?
>
>
> A pretty common thing I do when archiving files is make an md5sum list
>
> eg)  md5sum /archive/path/* > md5sum-list.txt
>
> Then later should I need to check the files are ok, perhaps before a
> restore, or when I copy them to somewhere else I'll do
> md5sum -c md5sum-list.txt
>
>
> I'd be ok doing it 1 file at a time
>
> java -jar <something> hdfs://some/path/in-hadoop/filename
>
>
> I'm also ok doing it serially through a single node, I've been doing some
> googling and JIRA ticket reading such as
> https://issues.apache.org/jira/browse/HADOOP-3981 and for my use case
> serial read is not a limitation.
>
> What is a bit of a requirement is something I can run as a standard linux
> command on local disk an do 1:1 output comparison.
> eg) Check HDFS md5sum on a file, copyToLocal, re-check local disk md5sum.
>
> Thanks,
> Scott
>
>
>
>
>
>
>

Re: md5sum of files on HDFS ?

Reply via email to