Ted, this might be tough -- the underlying LZO compression algorithm creates the block offsets. You can specify the LZO block size, but I don't think it's exact enough for what you're looking for.
Kevin On Mon, Jan 18, 2010 at 11:12 AM, Ted Yu <[email protected]> wrote: > For our custom text-based file format, we use empty line to mark data for > different households. > Can we make LZO block start to be aligned with new household, possibly by > modifying LzoIndexRecordWriter ? > > Thanks > > On Thu, Dec 31, 2009 at 3:44 PM, Kevin Weil <[email protected]> wrote: > > > Steve, glad you got it figured out. Interested to hear how it goes, and > of > > course feel free to post bugs/requests to the github page > > www.github.com/kevinweil/hadoop-lzo. > > > > Kevin > > > > On Thu, Dec 31, 2009 at 12:21 PM, Steve Kuo <[email protected]> wrote: > > > > > Digging around the new Job api with a rested brain came up with > > > > > > job.setInputFormatClass(LzoTextInputFormat.class); > > > > > > that solved the problem. > > > > > > On Thu, Dec 31, 2009 at 9:53 AM, Steve Kuo <[email protected]> > wrote: > > > > > > > I have followed > > > > > > > > > > http://www.cloudera.com/blog/2009/11/17/hadoop-at-twitter-part-1-splittable-lzo-compression/and > > > > http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ to build > the > > > > requisite hadoop-lzo jar and native .so files. (The jar and .so > files > > > were > > > > built from Kevin Weil's git repository. Thanks Kevin.) I have > > > configured > > > > core-site.xml and mapred-site.xml as instructed to enable lzo for > both > > > map > > > > and reduce output. The creation of lzo index also worked. The last > > step > > > was > > > > to replace TextInputFormat with LzoTextInputFormat. As I only have > > > > > > > > FileInputFormat.addInputPath(jobConf, new Path(inputPath)); > > > > > > > > it was replaced with > > > > > > > > LzoTextInputFormat.addInputPath(job, new Path(inputPath)); > > > > > > > > When I ran my MR job, I noticed that the new code was able to read in > > > .lzo > > > > input files and decompressed fine. The output was also lzo > > compressed. > > > > However, only one map job was created for each input .lzo file > > indicating > > > > that input splitting was not done by LzoTextInputFormat but more > likely > > > by > > > > its parent such as FileInputFormat. There must be a way to ensure > > > > LzoTextInputFormat is used in the Map task. How can this be done? > > > > > > > > Thanks in advance. > > > > > > > > > > > > > >
