Will, I had to write a custom binary input format and reader for a proprietary file format. It is sort of a pain, but I believe you could write a splittable version for what you are talking about. I think the basics would be you need to continuously watch out for reading inside your split, and then have a heuristic for reading over the split boundary. For instance, in my setup, if the split started at 0 (beginning of file), then it began reading from offset 0. However, for all other situations, I would have the reader "align the 'read head'" to a object boundary (past the split offset), and then begin reading. If the object lay over into the next split, then the reader would read to the end of the current object, and stop. The next split was responsible for knowing that it needed to not start at its arbitrary offset, but to move the 'read head' to the first object boundary, and begin working from there. Just remember to think of it from the perspective of the map function asking for the next arbitrary value (eventually calling 'public synchronized boolean next( Writable key) on your reader) --- as long as that returns valid data and the underlying mechanics maintain split boundaries across objects, you should be ok. It sorta is a pain making sure all of these criteria are met, but I'd be willing to bet you can do it with various binary formats if you are so inclined.
Josh Patterson -----Original Message----- From: william kinney [mailto:[email protected]] Sent: Wednesday, June 24, 2009 12:17 PM To: [email protected] Subject: Custom Binary FileInputFormat, splitting Hi, I have binary files in the HDFS that I am creating a InputFormat (and RecordReader) for. The binary format is something like [X of length 4 bytes][Y of X size], where X evaluates to an int, and the pattern continues as XYXYXYXY. I use X (size) to know the length of the next record to read (Y). Does that mean I then cannot support isSplitable() == true because the records are variable length? Are there any tips or best practices in reading in binary file formats? Thanks, Will
