Re: Binary Files With No Record Begin and End

Harsh J Thu, 05 Jul 2012 18:07:42 -0700

I am assuming you've already implemented a custom record reader for
your file and are now only thinking of how to handle record
boundaries. If so, please read the map section here:
http://wiki.apache.org/hadoop/HadoopMapReduce, which explains how MR
does it for Text files, which are read until a \n point. In your case,
you ought to divide, based on the file length, into chunks of 180
bytes each. Then, referencing the block offset and length, you can
auto-determine the start/end length points of each chunk, that is what
Kai was getting at.

For example, lets assume a block of 64 MB - 67108864 each. Lets also
assume each record, right from the start of your file, is 180 bytes
each always. Therefore, if we split the file in 180-multiple splits,
your problem immediately goes away.

So your InputFormat#getSplits() can perhaps do the following:

1. Split file by block sizes first. Lets assume we have two full 64 MB
cuts, and one tail cut of 52 MB (totaling the file size to 180 MB,
assuming we have 1024*1024 records in it). Then we can tweak the
FileSplits to instead end at proper 180-multiples (use modulo operator
to find good boundaries):

First FileSplit - Start at 0, end at 67109040, such that we have
372828 full records in it. This is 176 more bytes after the first HDFS
block.
Second FileSplit - Start, obviously, at 67109040, end at 134217900,
which is 172 bytes more than the second block's 64 MB boundary. This
would then again contain 372827 perfect records in it.
Last FileSplit - Start at 134217900, to EOF, consisting automatically
all perfect records remaining to be read.

Does this make sense?

On Fri, Jul 6, 2012 at 1:31 AM, MJ Sam <mikesam...@gmail.com> wrote:
> By Block Size, you mean the HDFS block size or split size or my record
> size? The problem is that given a split to my mapper, how do make my
> record reader to find where my record start in the given split stream
> to the mapper when there is no record start tag? Would you please
> explain more with what you mean?
>
> On Thu, Jul 5, 2012 at 11:57 AM, Kai Voigt <k...@123.org> wrote:
>> Hi,
>>
>> if you know the block size, you can calculate the offsets for your records. 
>> And write a custom record reader class to seek into your records.
>>
>> Kai
>>
>> Am 05.07.2012 um 22:54 schrieb MJ Sam:
>>
>>> Hi,
>>>
>>> The input of my map reduce is a binary file with no record begin and
>>> end marker. The only thing is that each record is a fixed 180bytes
>>> size in the binary file. How do I make Hadoop to properly find the
>>> record in the splits when a record overlap two splits. I was thinking
>>> to make the splits size to be a multiple of 180 but was wondering if
>>> there is anything else that I can do?  Please note that my files are
>>> not sequence file and just a custom binary file.
>>>
>>
>> --
>> Kai Voigt
>> k...@123.org
>>
>>
>>
>>

-- 
Harsh J

Re: Binary Files With No Record Begin and End

Reply via email to