And one more question to pile on:

What defines the binary data that the raw tuple comparator will be run on?
It seems like that it comes from hadoop, and the format generally makes
sense (you get bytes and do with them what you will). The thing that
confuses me is why don't you have to deal with the
RECORD_1/RECORD_2/RECORD_3/etc hooplah? Interstorage deals with all of that
and reads a deserialized tuple...so at what point do you get binary Tuple
data that doesn't have all of the split stuff? I'll keep digging through
but this is where my ignorance of the technicalities of the MR layer comes
in...

2012/5/23 Jonathan Coveney <[email protected]>

> Another question is clarifying what BinStorage does compared to
> InterStorage. It looks like it might just be a legacy storage format?
>
> I'm assuming that you do the R_1/R_2/R_3 to be able to find the next Tuple
> in the stream, but once you do that, can't you just read a tuple, and then
> read skip 12 bytes (3 ints), and keep reading?
>
>
> 2012/5/23 Jonathan Coveney <[email protected]>
>
>> I'm trying to understand how intermediate serialization in Pig works at a
>> deeper level (understanding the whole code path, not just BinInterSedes in
>> its own vaccuum). Right now I am looking at
>> InterRecordReader/InterRecordWriter/InterStorage. Is that the right place
>> to look for understanding how BinInterSedes is actually called?
>>
>> Further, I'm trying to better understanding the
>> RECORD_1/RECORD_2/RECORD_3 thing. My guess is that it's to make the file
>> splittable? But I'm not really sure. I'd love any pointers about where to
>> look for how BinInterSedes is used, and how intermediate storage happens.
>>
>> Thanks!
>> Jon
>>
>
>

Reply via email to