Will,
I had to write a custom binary input format and reader for a proprietary
file format. It is sort of a pain, but I believe you could write a
splittable version for what you are talking about. I think the basics
would be you need to continuously watch out for reading inside your
split, and then have a heuristic for reading over the split boundary.
For instance, in my setup, if the split started at 0 (beginning of
file), then it began reading from offset 0. However, for all other
situations, I would have the reader "align the 'read head'" to a object
boundary (past the split offset), and then begin reading. If the object
lay over into the next split, then the reader would read to the end of
the current object, and stop. The next split was responsible for knowing
that it needed to not start at its arbitrary offset, but to move the
'read head' to the first object boundary, and begin working from there.
Just remember to think of it from the perspective of the map function
asking for the next arbitrary value (eventually calling 'public
synchronized boolean next( Writable key) on your reader) --- as long as
that returns valid data and the underlying mechanics maintain split
boundaries across objects, you should be ok. It sorta is a pain making
sure all of these criteria are met, but I'd be willing to bet you can do
it with various binary formats if you are so inclined.

Josh Patterson 

-----Original Message-----
From: william kinney [mailto:[email protected]] 
Sent: Wednesday, June 24, 2009 12:17 PM
To: [email protected]
Subject: Custom Binary FileInputFormat, splitting

Hi,

I have binary files in the HDFS that I am creating a InputFormat (and
RecordReader) for. The binary format is something like [X of length 4
bytes][Y of X size], where X evaluates to an int, and the pattern
continues as XYXYXYXY. I use X (size) to know the length of the next
record to read (Y).

Does that mean I then cannot support isSplitable() == true because the
records are variable length?

Are there any tips or best practices in reading in binary file formats?

Thanks,
Will

Reply via email to