I have followed
http://www.cloudera.com/blog/2009/11/17/hadoop-at-twitter-part-1-splittable-lzo-compression/and
http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ to build the
requisite hadoop-lzo jar and native .so files.  (The jar and .so files were
built from Kevin Weil's git repository.  Thanks Kevin.)  I have configured
core-site.xml and mapred-site.xml as instructed to enable lzo for both map
and reduce output.  The creation of lzo index also worked. The last step was
to replace TextInputFormat with LzoTextInputFormat.  As I only have

    FileInputFormat.addInputPath(jobConf, new Path(inputPath));

it was replaced with

     LzoTextInputFormat.addInputPath(job, new Path(inputPath));

When I ran my MR job, I noticed that the new code was able to read in .lzo
input files and decompressed fine.   The output was also lzo compressed.
However, only one map job was created for each input .lzo file indicating
that input splitting was not done by LzoTextInputFormat but more likely by
its parent such as FileInputFormat.  There must be a way to ensure
LzoTextInputFormat is used in the Map task.  How can this be done?

Thanks in advance.

Reply via email to