Thanks Thejas! 2010/9/10 Thejas M Nair <te...@yahoo-inc.com>
> Yes, Zebra has columnar storage format. > Regarding selective deserialization (ie deserializing only columns that > are > actually needed for the pig query) - As per my understanding elephant-bird > has a protocol buffer based loader which does lazy deserialization. > PigStorage also does something similar- when PigStorage is used to load > data, pigstorage returns bytearray type and there is type-casting foreach > added by pig after the load which does the type conversion on the fields > that are required in rest of the query. > > -Thejas > > > > On 9/3/10 8:05 PM, "Renato Marroquín Mogrovejo" > <renatoj.marroq...@gmail.com> wrote: > > > Thanks Dmitriy! Hey, a couple of final questions please. > > Which are the deserializers that implement this selective > deserialization? > > And the columnar storage used is Zebra? > > Thanks again for the great replies. > > > > Renato M. > > > > 2010/9/2 Dmitriy Ryaboy <dvrya...@gmail.com> > > > >> Pig has selective deserialization and columnar storage if the loader you > >> are using implements it. So that depends on what you are doing. > Naturally, > >> if your data is not stored in a way that separates the columns, Pig > can't > >> magically read them separately :). > >> > >> You should try to always use combiners. > >> > >> -D > >> > >> > >> On Thu, Sep 2, 2010 at 2:51 PM, Renato Marroquín Mogrovejo < > >> renatoj.marroq...@gmail.com> wrote: > >> > >>> So in terms of performance is the same if I count just a single column > or > >>> the whole data set, right? > >>> But what Thejas said about the loader having optimizations (selective > >>> deserialization or columnar storage) is something that Pig actually > has? or > >>> is it something planned for the future? > >>> And hey using a combiner shouldn't be a thing we should try to avoid? I > >>> mean for the COUNT case, a combiner is needed, but are there any other > >>> operations that are put into that combiner? like trying to reuse the > >>> computation being made? > >>> Thanks for the replies (= > >>> > >>> Renato M. > >>> > >>> > >>> 2010/8/29 Mridul Muralidharan <mrid...@yahoo-inc.com> > >>> > >>> > >>>> > >>>> Reason why COUNT(a.field1) would have better performance is 'cos pig > does > >>>> not 'know' what is required from a tuple in case of COUNT(a). > >>>> In a custom mapred job, we can optimize it away so that only the > single > >>>> required field is projected out : but that is obviously not possible > here > >>>> (COUNT is a udf) : so the entire tuple is deserialized from input. > >>>> > >>>> Ofcourse, the performance difference, as Dmitriy noted, would not be > very > >>>> high. > >>>> > >>>> > >>>> Regards, > >>>> Mridul > >>>> > >>>> > >>>> > >>>> On Sunday 29 August 2010 01:14 AM, Renato Marroquín Mogrovejo wrote: > >>>> > >>>>> Hi, this is also interesting and kinda confusing for me too (= > >>>>> From the db world, the second one would have a better performance, > but > >>>>> Pig > >>>>> doesn't save statistics on the data, so it has to read the whole file > >>>>> anyways, and like the count operation is mainly done on the map side, > >>>>> all > >>>>> attributes will be read anyways, but the ones that are not > interesting > >>>>> for > >>>>> us will be dismissed and not passed to the reducer part of the job, > and > >>>>> besides wouldn't the presence of null values affect the performance? > For > >>>>> example, if a2 would have many null values, then less values would be > >>>>> passed > >>>>> too right? > >>>>> > >>>>> Renato M. > >>>>> > >>>>> 2010/8/27 Mridul Muralidharan<mrid...@yahoo-inc.com> > >>>>> > >>>>> > >>>>>> On second thoughts, that part is obvious - duh > >>>>>> > >>>>>> - Mridul > >>>>>> > >>>>>> > >>>>>> On Thursday 26 August 2010 01:56 PM, Mridul Muralidharan wrote: > >>>>>> > >>>>>> > >>>>>>> But it does for COUNT(A.a2) ? > >>>>>>> That is interesting, and somehow weird :) > >>>>>>> > >>>>>>> Thanks ! > >>>>>>> Mridul > >>>>>>> > >>>>>>> On Thursday 26 August 2010 09:05 AM, Dmitriy Ryaboy wrote: > >>>>>>> > >>>>>>> I think if you do COUNT(A), Pig will not realize it can ignore a2 > and > >>>>>>>> a3, and project all of them. > >>>>>>>> > >>>>>>>> On Wed, Aug 25, 2010 at 4:31 PM, Mridul Muralidharan > >>>>>>>> <mrid...@yahoo-inc.com<mailto:mrid...@yahoo-inc.com>> wrote: > >>>>>>>> > >>>>>>>> > >>>>>>>> I am not sure why second option is better - in both cases, you > >>>>>>>> are > >>>>>>>> shipping only the combined counts from map to reduce. > >>>>>>>> On other hand, first could be better since it means we need to > >>>>>>>> project only 'a1' - and none of the other fields. > >>>>>>>> > >>>>>>>> Or did I miss something here ? > >>>>>>>> I am not very familiar to what pig does in this case right > now. > >>>>>>>> > >>>>>>>> Regards, > >>>>>>>> Mridul > >>>>>>>> > >>>>>>>> > >>>>>>>> On Thursday 26 August 2010 03:45 AM, Dmitriy Ryaboy wrote: > >>>>>>>> > >>>>>>>> Generally speaking, the second option will be more > performant > >>>>>>>> as > >>>>>>>> it might > >>>>>>>> let you drop column a3 early. In most cases the magnitude > of > >>>>>>>> this is likely > >>>>>>>> to be very small as COUNT is an algebraic function, so > most > >>>>>>>> of > >>>>>>>> the work is > >>>>>>>> done map-side anyway, and only partial, pre-aggregated > counts > >>>>>>>> are shipped > >>>>>>>> from mappers to reducers. However, if A is very wide, or a > >>>>>>>> column store, or > >>>>>>>> has non-negligible deserialization cost that can be offset > by > >>>>>>>> only > >>>>>>>> deserializing a few fields -- the second option is better. > >>>>>>>> > >>>>>>>> -D > >>>>>>>> > >>>>>>>> On Wed, Aug 25, 2010 at 1:58 PM, Corbin Hoenes< > >>>>>>>> cor...@tynt.com > >>>>>>>> <mailto:cor...@tynt.com>> wrote: > >>>>>>>> > >>>>>>>> Wondering about performance and count... > >>>>>>>> A = load 'test.csv' as (a1,a2,a3); > >>>>>>>> B = GROUP A by a1; > >>>>>>>> -- which preferred? > >>>>>>>> C = FOREACH B GENERATE COUNT(A); > >>>>>>>> -- or would this only send a single field through the > >>>>>>>> COUNT > >>>>>>>> and be more > >>>>>>>> performant? > >>>>>>>> C = FOREACH B GENERATE COUNT(A.a2); > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>> > >>> > >> > > > > >