Hi Rajarshi, FileInputFormat (SDFInputFormat's superclass) will break files into splits, typically on HDFS block boundaries (if the defaults are left unchanged). This is not a problem for your code however, since it will read every record that starts within a split (even if it crosses a split boundary). This is just like how TextInputFormat works. So you don't need to use MultiFileInputFormat - it should work as is. You could demonstrate this to yourself by writing a multi-block file, and doing an identity MapReduce on it. You should find that no records are lost.
You might be able to use org.apache.hadoop.streaming.StreamXmlRecordReader (and StreamInputFormat), which does something similar. Despite its name it is not only for Streaming applications, and it isn't restricted to XML. It can parse records that begin with a certain sequence of characters, and end with another sequence. Cheers, Tom On Wed, May 6, 2009 at 2:06 AM, Nick Cen <cenyo...@gmail.com> wrote: > I think your SDFInputFormat should implement the MultiFileInputFormat > instead of the TextInputFormat, which will not splid the file into chunk. > > 2009/5/6 Rajarshi Guha <rg...@indiana.edu> > >> Hi, I have implemented a subclass of RecordReader to handle a plain text >> file format where a record is multi-line and of variable length. >> Schematically each record is of the form >> >> some_title >> foo >> bar >> $$$$ >> another_title >> foo >> foo >> bar >> $$$$ >> >> where $$$$ is the marker for the end of the record. My code is at >> http://blog.rguha.net/?p=293 and it seems to work fine on my input data. >> >> However, I realized that when I run the program, Hadoop will 'chunk' the >> input file. As a result, the SDFRecordReader might get a chunk of input >> text, such that the last record is actually incomplete (a missing $$$$). Is >> this correct? >> >> If so, how would the RecordReader implementation recover from this >> situation? Or is there a way to indicate to Hadoop that the input file >> should be chunked keeping in mind end of record delimiters? >> >> Thanks >> >> ------------------------------------------------------------------- >> Rajarshi Guha <rg...@indiana.edu> >> GPG Fingerprint: D070 5427 CC5B 7938 929C DD13 66A1 922C 51E7 9E84 >> ------------------------------------------------------------------- >> Q: What's polite and works for the phone company? >> A: A deferential operator. >> >> >> > > > -- > http://daily.appspot.com/food/ >