Hi all, I'm writing some Hadoop jobs that should run on a collection of gzipped files. Everything is already working correctly with MultiFileInputFormat and an initial step of gunzip extraction. Considering that Hadoop recognizes and handles correctly .gz files (at least with a single file input), I was wondering if it's able to do the same with file collections, such that I avoid the overhad of sequential file extraction. I tried to run the multi file WordCount example with a bunch of gzipped text files (0.17.1 installation), and I get a wrong output (neither correct or empty). With my own InputFormat (not really different from the one in multiflewc), I got no output at all (map input record counter = 0).
Is it a desired behavior? Are there some technical reasons why it's not working in a multi file scenario? Thanks in advance for the help. Regards, Michele Catasta
