On Feb 19, 2008, at 12:31 PM, Doug Cutting wrote:
Goel, Ankur wrote:
Hi All,
Is there an input format available for reading from
tarballs
(.tar.gz files) ?
Not at present. There is support for reading .gz files, but
not .tar files. A problem is that that there's no way to read a
chunk of such archives without reading everything preceding that
chunk. So, if such an InputFormat were written, it would be unable
to efficiently split the processing of an archive among map tasks.
Would it make sense to write a simple tool (maybe a Map-Reduce
application) which given a tar will uncompress it and write it out as
separate files? Folks can then run Map-Reduce applications on top of
the uncompressed data...
Blimey! This should be supported by distcp! *smile*
Arun
Doug