On May 6, 2009, at 8:22 AM, Tom White wrote:
Hi Rajarshi, FileInputFormat (SDFInputFormat's superclass) will break files into splits, typically on HDFS block boundaries (if the defaults are left unchanged). This is not a problem for your code however, since it will read every record that starts within a split (even if it crosses a split boundary). This is just like how TextInputFormat works. So you don't need to use MultiFileInputFormat - it should work as is. You could demonstrate this to yourself by writing a multi-block file, and doing an identity MapReduce on it. You should find that no records are lost.
Thanks for the description - once I realized that FileSplit.getStart() and getLength() provide me file offsets, I was able to modify my RecordReader subclass to deal with chunks starting and/or ending in the middle of a record. (For my own understanding I wrote it up at http://blog.rguha.net/?p=310 - maybe it'll be useful for other newbies)
You might be able to use org.apache.hadoop.streaming.StreamXmlRecordReader (and StreamInputFormat), which does something similar. Despite its name it is not only for Streaming applications, and it isn't restricted to XML. It can parse records that begin with a certain sequence of characters, and end with another sequence.
I did indeed see this, after I wrote my own record reader :) ------------------------------------------------------------------- Rajarshi Guha <rg...@indiana.edu> GPG Fingerprint: D070 5427 CC5B 7938 929C DD13 66A1 922C 51E7 9E84 ------------------------------------------------------------------- Q: What's polite and works for the phone company? A: A deferential operator.