Hi Tom, i have seen the tar-to-seq tool but the person who made it says it is very slow: "It took about an hour and a half to convert a 615MB tar.bz2 file to an 868MB sequence file". To me it is not acceptable. Normally to generate a tar file from 615MB od data it take s less then one minute. And, in my view the generatin of a sequence file should be even simper. You have just to append files and headers without worring about hierarchy.
Regarding the SequenceFileAsTextInputFormat I am not sure it will do the job I am looking for. The hadoop documentation says: SequenceFileAsTextInputFormat generates SequenceFileAsTextRecordReader which converts the input keys and values to their String forms by calling toString() method. Let we suppose that the keys and values were generated using tar-to-seq on a tar archive. Each value is a bytearray that stores the content of a file which can be any kind of data (in example a jpeg picture). It doesn't make sense to convert this data into a string. What is needed is a tool to simply extract the file as with tar -xf archive.tar filename. The hadoop framework can be used to extract a Java class and you have to do that within a java program. The streaming package is meant to be used in a unix shell without the need of java programming. But I think it is not very usefull if the sequencefile (which is the principal data structure of hadoop) is not accessible from a shell command. Walter