Well, I don't know much about the tar tool at all. But bz2 is a VERY slow compression scheme (though quite fascinating to read about how it works). A plain tar, or tar.gz will be faster if it is supported.
On 5/28/09 10:10 PM, "walter steffe" <ste...@tiscali.it> wrote: > Hi Tom, > > i have seen the tar-to-seq tool but the person who made it says it is > very slow: > "It took about an hour and a half to convert a 615MB tar.bz2 file to an > 868MB sequence file". To me it is not acceptable. > Normally to generate a tar file from 615MB od data it take s less then > one minute. And, in my view the generatin of a sequence file should be > even simper. You have just to append files and headers without worring > about hierarchy. > > Regarding the SequenceFileAsTextInputFormat I am not sure it will do the > job I am looking for. > The hadoop documentation says: SequenceFileAsTextInputFormat generates > SequenceFileAsTextRecordReader which converts the input keys and values > to their String forms by calling toString() method. > Let we suppose that the keys and values were generated using tar-to-seq > on a tar archive. Each value is a bytearray that stores the content of a > file which can be any kind of data (in example a jpeg picture). It > doesn't make sense to convert this data into a string. > > What is needed is a tool to simply extract the file as with > tar -xf archive.tar filename. The hadoop framework can be used to > extract a Java class and you have to do that within a java program. The > streaming package is meant to be used in a unix shell without the need > of java programming. But I think it is not very usefull if the > sequencefile (which is the principal data structure of hadoop) is not > accessible from a shell command. > > > Walter > > >