Yes, it's totally possible for part of one record in the first file split and the rest in the second file split. It's the job of the RecordReader to make sure it's always reading in an entire record. Given a file split, your RecordReader has to be able to skip over the first few bytes to get to the first full record (if there's a partial record at the beginning). When it reaches the end of the split, if there's a partial record there, it will go get the rest of the record from the next split.
Tom's email earlier in this thread explained some of the details. Like he said, look at LineRecordReader for inspiration. The logic for figuring out the start of the first full record is in LineRecordReader itself. The RecordReader can read the last record (that spans two file splits) without any special logic because the Hadoop filesystem abstracts away file split boundaries when reading. On Mon, Jun 1, 2009 at 8:05 PM, Yabo-Arber Xu <[email protected]>wrote: > I have a follow-up question on this thread: How do we make sure that at the > getFileSplit phase, there is no records that cross the boundary of > different > file splits? > > To explain my point better, for example, if each of my record is 100 bytes, > would there be such case that there is some record whose key was put in the > 1st filesplit, while its value was put in the second split? > > Best, > Arber > > On Thu, May 28, 2009 at 10:50 PM, Owen O'Malley <[email protected]> > wrote: > > > On May 28, 2009, at 5:15 AM, Stuart White wrote: > > > > I need to process a dataset that contains text records of fixed length > >> in bytes. For example, each record may be 100 bytes in length > >> > > > > The update to the terasort example has an InputFormat that does exactly > > that. The key is 10 bytes and the value is the next 90 bytes. It is > pretty > > easy to write, but I should upload it soon. The output types are Text, > but > > they just have the binary data in them. > > > > -- Owen > > >
