There is related issue and discussion at https://issues.apache.org/jira/browse/MAPREDUCE-589.
On 7/16/10 1:04 AM, "David Pellegrini" <[email protected]> wrote: Hi All, I haven't seen this discussed in documentation or user forums, so I'm hoping someone here can provide some guidance. :-) I created a M/R job using StreamXmlRecordReader to read XML input, and it works fine when testing with uncompressed files. However, the files I have to process in production are gzip'ed, and when running them as input, the mapper task was never invoked. No splits were generated or identified in the input. Points: 1. From "Hadoop: The Definitive Guide" -- "if your input files are compressed, they will be automatically decompressed as they are read by MapReduce, using the filename extension to determine the codec to use." 2. gzip compression is not splittable 3. StreamingInputFormat implements isSplittable() based on the codec. In the spirit of "I can't believe I'm the first person to attempt processing gzip'ed XML," who has done this and can share the secrets of their success? Or have all attempts at this failed, so I should stop now and try another approach entirely? Thanks! David
