Re: multi-line records and file splits

Tom White Wed, 06 May 2009 05:22:39 -0700

Hi Rajarshi,

FileInputFormat (SDFInputFormat's superclass) will break files into
splits, typically on HDFS block boundaries (if the defaults are left
unchanged). This is not a problem for your code however, since it will
read every record that starts within a split (even if it crosses a
split boundary). This is just like how TextInputFormat works. So you
don't need to use MultiFileInputFormat - it should work as is. You
could demonstrate this to yourself by writing a multi-block file, and
doing an identity MapReduce on it. You should find that no records are
lost.


You might be able to use
org.apache.hadoop.streaming.StreamXmlRecordReader (and
StreamInputFormat), which does something similar. Despite its name it
is not only for Streaming applications, and it isn't restricted to
XML. It can parse records that begin with a certain sequence of
characters, and end with another sequence.

Cheers,
Tom

On Wed, May 6, 2009 at 2:06 AM, Nick Cen <cenyo...@gmail.com> wrote:
> I think your SDFInputFormat should implement the MultiFileInputFormat
> instead of the TextInputFormat, which will not splid the file into chunk.
>
> 2009/5/6 Rajarshi Guha <rg...@indiana.edu>
>
>> Hi, I have implemented a subclass of RecordReader to handle a plain text
>> file format where a record is multi-line and of variable length.
>> Schematically each record is of the form
>>
>> some_title
>> foo
>> bar
>> $$$$
>> another_title
>> foo
>> foo
>> bar
>> $$$$
>>
>> where $$$$ is the marker for the end of the record. My code is at
>> http://blog.rguha.net/?p=293 and it seems to work fine on my input data.
>>
>> However, I realized that when I run the program, Hadoop will 'chunk' the
>> input file. As a result, the SDFRecordReader might get a chunk of input
>> text, such that the last record is actually incomplete (a missing $$$$). Is
>> this correct?
>>
>> If so, how would the RecordReader implementation recover from this
>> situation? Or is there a way to indicate to Hadoop that the input file
>> should be chunked keeping in mind end of record delimiters?
>>
>> Thanks
>>
>> -------------------------------------------------------------------
>> Rajarshi Guha  <rg...@indiana.edu>
>> GPG Fingerprint: D070 5427 CC5B 7938 929C  DD13 66A1 922C 51E7 9E84
>> -------------------------------------------------------------------
>> Q:  What's polite and works for the phone company?
>> A:  A deferential operator.
>>
>>
>>
>
>
> --
> http://daily.appspot.com/food/
>

Re: multi-line records and file splits

Reply via email to