Re: multi-line records and file splits

Rajarshi Guha Wed, 06 May 2009 07:19:44 -0700


On May 6, 2009, at 8:22 AM, Tom White wrote:

Hi Rajarshi,

FileInputFormat (SDFInputFormat's superclass) will break files into
splits, typically on HDFS block boundaries (if the defaults are left
unchanged). This is not a problem for your code however, since it will
read every record that starts within a split (even if it crosses a
split boundary). This is just like how TextInputFormat works. So you
don't need to use MultiFileInputFormat - it should work as is. You
could demonstrate this to yourself by writing a multi-block file, and
doing an identity MapReduce on it. You should find that no records are
lost.

Thanks for the description - once I realized that FileSplit.getStart()and getLength() provide me file offsets, I was able to modify myRecordReader subclass to deal with chunks starting and/or ending inthe middle of a record. (For my own understanding I wrote it up at http://blog.rguha.net/?p=310- maybe it'll be useful for other newbies)

You might be able to use
org.apache.hadoop.streaming.StreamXmlRecordReader (and
StreamInputFormat), which does something similar. Despite its name it
is not only for Streaming applications, and it isn't restricted to
XML. It can parse records that begin with a certain sequence of
characters, and end with another sequence.


I did indeed see this, after I wrote my own record reader :)

-------------------------------------------------------------------
Rajarshi Guha  <rg...@indiana.edu>
GPG Fingerprint: D070 5427 CC5B 7938 929C  DD13 66A1 922C 51E7 9E84
-------------------------------------------------------------------
Q:  What's polite and works for the phone company?
A:  A deferential operator.

Re: multi-line records and file splits

Reply via email to