The FileSytem API exposes a getFileChecksum() method http://archive.cloudera.com/cdh/3/hadoop-0.20.2-CDH3B4/api/org/apache/hadoop/fs/FileSystem.html#getFileChecksum(org.apache.hadoop.fs.Path) too. However, this isn't the straight-up MD5 of the file.
[1]monster01::groovy-1.7.8(12813)$CLASSPATH=$(hadoop classpath) bin/groovysh groovy:000> import org.apache.hadoop.fs.FileSystem ===> [import org.apache.hadoop.fs.FileSystem] groovy:000> import org.apache.hadoop.conf.Configuration ===> [import org.apache.hadoop.fs.FileSystem, import org.apache.hadoop.conf.Configuration] groovy:000> import org.apache.hadoop.fs.Path ===> [import org.apache.hadoop.fs.FileSystem, import org.apache.hadoop.conf.Configuration, import org.apache.hadoop.fs.Path] groovy:000> fs = FileSystem.get(new Configuration()) ===> DFS[DFSClient[clientName=DFSClient_-919136415, ugi=philip]] groovy:000> fs ===> DFS[DFSClient[clientName=DFSClient_-919136415, ugi=philip]] groovy:000> fs.getFileChecksum(new Path("/tmp/issue")) ===> MD5-of-0MD5-of-512CRC32:eeec9870219b2381f99ac8ea0c2d0d60 groovy:000> Whereas: [0]monster01::~(12845)$hadoop fs -put /etc/issue /tmp/issue [1]monster01::~(12844)$md5sum /etc/issue 6c9222ee501323045d85545853ebea55 /etc/issue On Wed, Mar 2, 2011 at 9:47 AM, <stu24m...@yahoo.com> wrote: > I don't think there is a built in command. I would just use the java or > thrift api to read the file & calculate the hash. (thrift + python/ruby/etc) > > Take care, > -stu > -----Original Message----- > From: Scott Golby <sgo...@conductor.com> > Date: Wed, 2 Mar 2011 11:05:04 > To: hdfs-user@hadoop.apache.org<hdfs-user@hadoop.apache.org> > Reply-To: hdfs-user@hadoop.apache.org > Subject: md5sum of files on HDFS ? > > Hi Everyone, > > How can I do a md5sum/sha1sum directly against files on HDFS ? > > > A pretty common thing I do when archiving files is make an md5sum list > > eg) md5sum /archive/path/* > md5sum-list.txt > > Then later should I need to check the files are ok, perhaps before a > restore, or when I copy them to somewhere else I'll do > md5sum -c md5sum-list.txt > > > I'd be ok doing it 1 file at a time > > java -jar <something> hdfs://some/path/in-hadoop/filename > > > I'm also ok doing it serially through a single node, I've been doing some > googling and JIRA ticket reading such as > https://issues.apache.org/jira/browse/HADOOP-3981 and for my use case > serial read is not a limitation. > > What is a bit of a requirement is something I can run as a standard linux > command on local disk an do 1:1 output comparison. > eg) Check HDFS md5sum on a file, copyToLocal, re-check local disk md5sum. > > Thanks, > Scott > > > > > > >