Hi William,
With that kind of file format, you have a couple options:
The first option is, as you figured, to just not have splittable files. If
your dataset is already split into a lot of separate files, each of which
can be processed in a reasonable amount of time by a single task, that
solution is certainly the easiest and will work correctly.
The second option is to preprocess the file to generate some kind of
secondary file used for generating splits. Something like the following
pseudocode:
rec_num = 0
while not eof(file):
len = file.readInt()
file.skip(len)
if rec_num++ % 1000 == 0:
print rec_num, file.tell()
This secondary file will then have the file offsets for every 1000 records.
You can use this to compute InputSplits from your InputFormat.
The last option is to modify your file format to pad to chunks. For example,
whenever you are about to write a record which would cross a 1MB boundary,
instead pad the file with 0s up to the 1MB block offset. This is slightly
tricky if you have some records which may be larger than 1MB, and of course
you'll need to edit your file format. If you're going this route, you might
be better off simply converting your files into SequenceFile format using
NullWritable keys and BytesWritable values. You can then simply use
SequenceFileInputFormat and not worry about computing splits, etc.
Hope that helps,
-Todd
On Wed, Jun 24, 2009 at 5:16 PM, william kinney <[email protected]>wrote:
> Hi,
>
> I have binary files in the HDFS that I am creating a InputFormat (and
> RecordReader) for. The binary format is something like [X of length 4
> bytes][Y of X size], where X evaluates to an int, and the pattern
> continues as XYXYXYXY. I use X (size) to know the length of the next
> record to read (Y).
>
> Does that mean I then cannot support isSplitable() == true because the
> records are variable length?
>
> Are there any tips or best practices in reading in binary file formats?
>
> Thanks,
> Will
>