On Feb 19, 2008, at 12:31 PM, Doug Cutting wrote:

Goel, Ankur wrote:
Hi All,
Is there an input format available for reading from tarballs
(.tar.gz files) ?

Not at present. There is support for reading .gz files, but not .tar files. A problem is that that there's no way to read a chunk of such archives without reading everything preceding that chunk. So, if such an InputFormat were written, it would be unable to efficiently split the processing of an archive among map tasks.


Would it make sense to write a simple tool (maybe a Map-Reduce application) which given a tar will uncompress it and write it out as separate files? Folks can then run Map-Reduce applications on top of the uncompressed data...

Blimey! This should be supported by distcp! *smile*

Arun

Doug

Reply via email to