SequenceFile and streaming

walter steffe Wed, 27 May 2009 22:52:12 -0700

Hello
  I am a new user and I would like to use hadoop streaming with
SequenceFile in both input and output side.


-The first difficoulty arises from the lack of a simple tool to generate
a SequenceFile starting from a set of files in a given directory.
I would like to have something similar to "tar -cvf file.tar foo/" 
This should work also in the opposite direction like "tar -xvf file.tar"

-An other important feature that I would like to see is the possibility
to feed the mapper stdin with the whole content of a file (extracted
from the file SequenceFile) disregarding the key.
Using each file as a tar archive I it would like to be able to do:

 $HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
                  -input "/user/me/inputSequenceFile"  \
                  -output "/user/me/outputSequenceFile"  \
                  -inputformat SequenceFile
                  -outputformat SequenceFile
                  -mapper myscript.sh
                  -reducer NONE

 myscrip.sh should work as a filter which takes its input from 
 stdin and put the output on stdout:

  tar -x
  "do something on the generated dir and create an outputfile"
  cat outputfile

The output file should (automatically) go into the outputSequenceFile.

I think that this would be a very usefull schema which fits well with
the mapreduce requirements on one side and with the unix commands on the
other side. It should not be too difficoult to implement the tools
needed for that.


Walter

SequenceFile and streaming

Reply via email to