Re: How to ensure LzoTextInputFormat is used to generate input splits for .lzo files

Kevin Weil Sat, 23 Jan 2010 11:07:59 -0800

Ted, this might be tough -- the underlying LZO compression algorithm creates
the block offsets.  You can specify the LZO block size, but I don't think
it's exact enough for what you're looking for.


Kevin


On Mon, Jan 18, 2010 at 11:12 AM, Ted Yu <[email protected]> wrote:

> For our custom text-based file format, we use empty line to mark data for
> different households.
> Can we make LZO block start to be aligned with new household, possibly by
> modifying LzoIndexRecordWriter ?
>
> Thanks
>
> On Thu, Dec 31, 2009 at 3:44 PM, Kevin Weil <[email protected]> wrote:
>
> > Steve, glad you got it figured out.  Interested to hear how it goes, and
> of
> > course feel free to post bugs/requests to the github page
> > www.github.com/kevinweil/hadoop-lzo.
> >
> > Kevin
> >
> > On Thu, Dec 31, 2009 at 12:21 PM, Steve Kuo <[email protected]> wrote:
> >
> > > Digging around the new Job api with a rested brain came up with
> > >
> > >             job.setInputFormatClass(LzoTextInputFormat.class);
> > >
> > > that solved the problem.
> > >
> > > On Thu, Dec 31, 2009 at 9:53 AM, Steve Kuo <[email protected]>
> wrote:
> > >
> > > > I have followed
> > > >
> > >
> >
> http://www.cloudera.com/blog/2009/11/17/hadoop-at-twitter-part-1-splittable-lzo-compression/and
> > > > http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ to build
> the
> > > > requisite hadoop-lzo jar and native .so files.  (The jar and .so
> files
> > > were
> > > > built from Kevin Weil's git repository.  Thanks Kevin.)  I have
> > > configured
> > > > core-site.xml and mapred-site.xml as instructed to enable lzo for
> both
> > > map
> > > > and reduce output.  The creation of lzo index also worked. The last
> > step
> > > was
> > > > to replace TextInputFormat with LzoTextInputFormat.  As I only have
> > > >
> > > >     FileInputFormat.addInputPath(jobConf, new Path(inputPath));
> > > >
> > > > it was replaced with
> > > >
> > > >      LzoTextInputFormat.addInputPath(job, new Path(inputPath));
> > > >
> > > > When I ran my MR job, I noticed that the new code was able to read in
> > > .lzo
> > > > input files and decompressed fine.   The output was also lzo
> > compressed.
> > > > However, only one map job was created for each input .lzo file
> > indicating
> > > > that input splitting was not done by LzoTextInputFormat but more
> likely
> > > by
> > > > its parent such as FileInputFormat.  There must be a way to ensure
> > > > LzoTextInputFormat is used in the Map task.  How can this be done?
> > > >
> > > > Thanks in advance.
> > > >
> > > >
> > >
> >
>

Re: How to ensure LzoTextInputFormat is used to generate input splits for .lzo files

Reply via email to