Re: Some questions on intermediate serialization in Pig

Gianmarco De Francisci Morales Sat, 26 May 2012 13:15:36 -0700

I am not sure, but I will have a look at it (I implemented the raw
comparator for secondary sort).
I don't remember having to deal with this issue.


Cheers,
--
Gianmarco




On Fri, May 25, 2012 at 11:07 PM, Jonathan Coveney <[email protected]>wrote:

> I'll just bump this once. The main thing I'm still unsure on is just the
> relationship various raw comparators, Pig, and hadoop. If we're serializing
> RECORD_1, RECORD_2, RECORD_3, Tuple, RECORD_1, RECORD_2, RECORD_3,
> Tuple, RECORD_1, RECORD_2, RECORD_3, Tuple, RECORD_1, RECORD_2, RECORD_3,
> Tuple, RECORD_1, RECORD_2, RECORD_3, Tuple, and so on, how come it appears
> that the raw comparators aren't aware of it?
>
> 2012/5/23 Jonathan Coveney <[email protected]>
>
> > And one more question to pile on:
> >
> > What defines the binary data that the raw tuple comparator will be run
> on?
> > It seems like that it comes from hadoop, and the format generally makes
> > sense (you get bytes and do with them what you will). The thing that
> > confuses me is why don't you have to deal with the
> > RECORD_1/RECORD_2/RECORD_3/etc hooplah? Interstorage deals with all of
> that
> > and reads a deserialized tuple...so at what point do you get binary Tuple
> > data that doesn't have all of the split stuff? I'll keep digging through
> > but this is where my ignorance of the technicalities of the MR layer
> comes
> > in...
> >
> > 2012/5/23 Jonathan Coveney <[email protected]>
> >
> >> Another question is clarifying what BinStorage does compared to
> >> InterStorage. It looks like it might just be a legacy storage format?
> >>
> >> I'm assuming that you do the R_1/R_2/R_3 to be able to find the next
> >> Tuple in the stream, but once you do that, can't you just read a tuple,
> and
> >> then read skip 12 bytes (3 ints), and keep reading?
> >>
> >>
> >> 2012/5/23 Jonathan Coveney <[email protected]>
> >>
> >>> I'm trying to understand how intermediate serialization in Pig works at
> >>> a deeper level (understanding the whole code path, not just
> BinInterSedes
> >>> in its own vaccuum). Right now I am looking at
> >>> InterRecordReader/InterRecordWriter/InterStorage. Is that the right
> place
> >>> to look for understanding how BinInterSedes is actually called?
> >>>
> >>> Further, I'm trying to better understanding the
> >>> RECORD_1/RECORD_2/RECORD_3 thing. My guess is that it's to make the
> file
> >>> splittable? But I'm not really sure. I'd love any pointers about where
> to
> >>> look for how BinInterSedes is used, and how intermediate storage
> happens.
> >>>
> >>> Thanks!
> >>> Jon
> >>>
> >>
> >>
> >
>

Re: Some questions on intermediate serialization in Pig

Reply via email to