RE:SequenceFile and streaming

walter steffe Thu, 28 May 2009 22:10:08 -0700

Hi Tom,

  i have seen the tar-to-seq tool but the person who made it says it is
very slow: 
"It took about an hour and a half to convert a 615MB tar.bz2 file to an
868MB sequence file". To me it is not acceptable.
Normally to generate a tar file from 615MB od data it take s less then
one minute. And, in my view the generatin of a sequence file should be 
even simper. You have just to append files and headers without worring
about hierarchy.


Regarding the SequenceFileAsTextInputFormat I am not sure it will do the
job I am looking for.
The hadoop documentation says: SequenceFileAsTextInputFormat generates
SequenceFileAsTextRecordReader which converts the input keys and values
to their String forms by calling toString() method.
Let we suppose that the keys and values were generated using tar-to-seq
on a tar archive. Each value is a bytearray that stores the content of a
file which can be any kind of data (in example a jpeg picture). It
doesn't make sense to convert this data into a string.

What is needed is a tool to simply extract the file as with 
tar -xf archive.tar filename. The hadoop framework can be used to
extract a Java class and you have to do that within a java program. The
streaming package is meant to be used in a unix shell without the need
of java programming. But I think it is not very usefull if the
sequencefile (which is the principal data structure of hadoop) is not
accessible from a shell command.


Walter

RE:SequenceFile and streaming

Reply via email to