There is related issue and discussion at 
https://issues.apache.org/jira/browse/MAPREDUCE-589.


On 7/16/10 1:04 AM, "David Pellegrini" <[email protected]> wrote:

Hi All,

I haven't seen this discussed in documentation or user forums, so I'm
hoping someone here can provide some guidance.  :-)

I created a M/R job using StreamXmlRecordReader to read XML input, and
it works fine when testing with uncompressed files.  However, the files
I have to process in production are gzip'ed, and when running them as
input, the mapper task was never invoked.  No splits were generated or
identified in the input.

Points:
   1. From "Hadoop: The Definitive Guide" -- "if your input files are
compressed, they will be automatically decompressed as they are read by
MapReduce, using the filename extension to determine the codec to use."
   2. gzip compression is not splittable
   3. StreamingInputFormat implements isSplittable() based on the codec.

In the spirit of "I can't believe I'm the first person to attempt
processing gzip'ed XML," who has done this and can share the secrets of
their success?  Or have all attempts at this failed, so I should stop
now and try another approach entirely?

Thanks!

David

Reply via email to