Hello I am a new user and I would like to use hadoop streaming with SequenceFile in both input and output side.
-The first difficoulty arises from the lack of a simple tool to generate a SequenceFile starting from a set of files in a given directory. I would like to have something similar to "tar -cvf file.tar foo/" This should work also in the opposite direction like "tar -xvf file.tar" -An other important feature that I would like to see is the possibility to feed the mapper stdin with the whole content of a file (extracted from the file SequenceFile) disregarding the key. Using each file as a tar archive I it would like to be able to do: $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \ -input "/user/me/inputSequenceFile" \ -output "/user/me/outputSequenceFile" \ -inputformat SequenceFile -outputformat SequenceFile -mapper myscript.sh -reducer NONE myscrip.sh should work as a filter which takes its input from stdin and put the output on stdout: tar -x "do something on the generated dir and create an outputfile" cat outputfile The output file should (automatically) go into the outputSequenceFile. I think that this would be a very usefull schema which fits well with the mapreduce requirements on one side and with the unix commands on the other side. It should not be too difficoult to implement the tools needed for that. Walter