I am not sure, but I will have a look at it (I implemented the raw comparator for secondary sort). I don't remember having to deal with this issue.
Cheers, -- Gianmarco On Fri, May 25, 2012 at 11:07 PM, Jonathan Coveney <jcove...@gmail.com>wrote: > I'll just bump this once. The main thing I'm still unsure on is just the > relationship various raw comparators, Pig, and hadoop. If we're serializing > RECORD_1, RECORD_2, RECORD_3, Tuple, RECORD_1, RECORD_2, RECORD_3, > Tuple, RECORD_1, RECORD_2, RECORD_3, Tuple, RECORD_1, RECORD_2, RECORD_3, > Tuple, RECORD_1, RECORD_2, RECORD_3, Tuple, and so on, how come it appears > that the raw comparators aren't aware of it? > > 2012/5/23 Jonathan Coveney <jcove...@gmail.com> > > > And one more question to pile on: > > > > What defines the binary data that the raw tuple comparator will be run > on? > > It seems like that it comes from hadoop, and the format generally makes > > sense (you get bytes and do with them what you will). The thing that > > confuses me is why don't you have to deal with the > > RECORD_1/RECORD_2/RECORD_3/etc hooplah? Interstorage deals with all of > that > > and reads a deserialized tuple...so at what point do you get binary Tuple > > data that doesn't have all of the split stuff? I'll keep digging through > > but this is where my ignorance of the technicalities of the MR layer > comes > > in... > > > > 2012/5/23 Jonathan Coveney <jcove...@gmail.com> > > > >> Another question is clarifying what BinStorage does compared to > >> InterStorage. It looks like it might just be a legacy storage format? > >> > >> I'm assuming that you do the R_1/R_2/R_3 to be able to find the next > >> Tuple in the stream, but once you do that, can't you just read a tuple, > and > >> then read skip 12 bytes (3 ints), and keep reading? > >> > >> > >> 2012/5/23 Jonathan Coveney <jcove...@gmail.com> > >> > >>> I'm trying to understand how intermediate serialization in Pig works at > >>> a deeper level (understanding the whole code path, not just > BinInterSedes > >>> in its own vaccuum). Right now I am looking at > >>> InterRecordReader/InterRecordWriter/InterStorage. Is that the right > place > >>> to look for understanding how BinInterSedes is actually called? > >>> > >>> Further, I'm trying to better understanding the > >>> RECORD_1/RECORD_2/RECORD_3 thing. My guess is that it's to make the > file > >>> splittable? But I'm not really sure. I'd love any pointers about where > to > >>> look for how BinInterSedes is used, and how intermediate storage > happens. > >>> > >>> Thanks! > >>> Jon > >>> > >> > >> > > >