Thanks Thejas!

2010/9/10 Thejas M Nair <te...@yahoo-inc.com>

> Yes, Zebra has columnar storage format.
> Regarding selective deserialization  (ie deserializing only columns that
> are
> actually needed for the pig query) - As per my understanding elephant-bird
> has a protocol buffer based loader which does lazy deserialization.
> PigStorage also does something similar- when PigStorage is used to load
> data, pigstorage returns bytearray type and there is type-casting foreach
> added by pig after the load which does the type conversion on the fields
> that are required in rest of the query.
>
> -Thejas
>
>
>
> On 9/3/10 8:05 PM, "Renato Marroquín Mogrovejo"
> <renatoj.marroq...@gmail.com> wrote:
>
> > Thanks Dmitriy! Hey, a couple of final questions please.
> > Which are the deserializers that implement this selective
> deserialization?
> > And the columnar storage used is Zebra?
> > Thanks again for the great replies.
> >
> > Renato M.
> >
> > 2010/9/2 Dmitriy Ryaboy <dvrya...@gmail.com>
> >
> >> Pig has selective deserialization and columnar storage if the loader you
> >> are using implements it. So that depends on what you are doing.
> Naturally,
> >> if your data is not stored in a way that separates the columns, Pig
> can't
> >> magically read them separately :).
> >>
> >> You should try to always use combiners.
> >>
> >> -D
> >>
> >>
> >> On Thu, Sep 2, 2010 at 2:51 PM, Renato Marroquín Mogrovejo <
> >> renatoj.marroq...@gmail.com> wrote:
> >>
> >>> So in terms of performance is the same if I count just a single column
> or
> >>> the whole data set, right?
> >>> But what Thejas said about the loader having optimizations (selective
> >>> deserialization or columnar storage) is something that Pig actually
> has? or
> >>> is it something planned for the future?
> >>> And hey using a combiner shouldn't be a thing we should try to avoid? I
> >>> mean for the COUNT case, a combiner is needed, but are there any other
> >>> operations that are put into that combiner? like trying to reuse the
> >>> computation being made?
> >>> Thanks for the replies (=
> >>>
> >>> Renato M.
> >>>
> >>>
> >>> 2010/8/29 Mridul Muralidharan <mrid...@yahoo-inc.com>
> >>>
> >>>
> >>>>
> >>>> Reason why COUNT(a.field1) would have better performance is 'cos pig
> does
> >>>> not 'know' what is required from a tuple in case of COUNT(a).
> >>>> In a custom mapred job, we can optimize it away so that only the
> single
> >>>> required field is projected out : but that is obviously not possible
> here
> >>>> (COUNT is a udf) : so the entire tuple is deserialized from input.
> >>>>
> >>>> Ofcourse, the performance difference, as Dmitriy noted, would not be
> very
> >>>> high.
> >>>>
> >>>>
> >>>> Regards,
> >>>> Mridul
> >>>>
> >>>>
> >>>>
> >>>> On Sunday 29 August 2010 01:14 AM, Renato Marroquín Mogrovejo wrote:
> >>>>
> >>>>> Hi, this is also interesting and kinda confusing for me too (=
> >>>>>  From the db world, the second one would have a better performance,
> but
> >>>>> Pig
> >>>>> doesn't save statistics on the data, so it has to read the whole file
> >>>>> anyways, and like the count operation is mainly done on the map side,
> >>>>> all
> >>>>> attributes will be read anyways, but the ones that are not
> interesting
> >>>>> for
> >>>>> us will be dismissed and not passed to the reducer part of the job,
> and
> >>>>> besides wouldn't the presence of null values affect the performance?
> For
> >>>>> example, if a2 would have many null values, then less values would be
> >>>>> passed
> >>>>> too right?
> >>>>>
> >>>>> Renato M.
> >>>>>
> >>>>> 2010/8/27 Mridul Muralidharan<mrid...@yahoo-inc.com>
> >>>>>
> >>>>>
> >>>>>> On second thoughts, that part is obvious - duh
> >>>>>>
> >>>>>> - Mridul
> >>>>>>
> >>>>>>
> >>>>>> On Thursday 26 August 2010 01:56 PM, Mridul Muralidharan wrote:
> >>>>>>
> >>>>>>
> >>>>>>> But it does for COUNT(A.a2) ?
> >>>>>>> That is interesting, and somehow weird :)
> >>>>>>>
> >>>>>>> Thanks !
> >>>>>>> Mridul
> >>>>>>>
> >>>>>>> On Thursday 26 August 2010 09:05 AM, Dmitriy Ryaboy wrote:
> >>>>>>>
> >>>>>>>  I think if you do COUNT(A), Pig will not realize it can ignore a2
> and
> >>>>>>>> a3, and project all of them.
> >>>>>>>>
> >>>>>>>> On Wed, Aug 25, 2010 at 4:31 PM, Mridul Muralidharan
> >>>>>>>> <mrid...@yahoo-inc.com<mailto:mrid...@yahoo-inc.com>>   wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>     I am not sure why second option is better - in both cases, you
> >>>>>>>> are
> >>>>>>>>     shipping only the combined counts from map to reduce.
> >>>>>>>>     On other hand, first could be better since it means we need to
> >>>>>>>>     project only 'a1' - and none of the other fields.
> >>>>>>>>
> >>>>>>>>     Or did I miss something here ?
> >>>>>>>>     I am not very familiar to what pig does in this case right
> now.
> >>>>>>>>
> >>>>>>>>     Regards,
> >>>>>>>>     Mridul
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>     On Thursday 26 August 2010 03:45 AM, Dmitriy Ryaboy wrote:
> >>>>>>>>
> >>>>>>>>         Generally speaking, the second option will be more
> performant
> >>>>>>>> as
> >>>>>>>>         it might
> >>>>>>>>         let you drop column a3 early. In most cases the magnitude
> of
> >>>>>>>>         this is likely
> >>>>>>>>         to be very small as COUNT is an algebraic function, so
> most
> >>>>>>>> of
> >>>>>>>>         the work is
> >>>>>>>>         done map-side anyway, and only partial, pre-aggregated
> counts
> >>>>>>>>         are shipped
> >>>>>>>>         from mappers to reducers. However, if A is very wide, or a
> >>>>>>>>         column store, or
> >>>>>>>>         has non-negligible deserialization cost that can be offset
> by
> >>>>>>>> only
> >>>>>>>>         deserializing a few fields -- the second option is better.
> >>>>>>>>
> >>>>>>>>         -D
> >>>>>>>>
> >>>>>>>>         On Wed, Aug 25, 2010 at 1:58 PM, Corbin Hoenes<
> >>>>>>>> cor...@tynt.com
> >>>>>>>>         <mailto:cor...@tynt.com>>    wrote:
> >>>>>>>>
> >>>>>>>>             Wondering about performance and count...
> >>>>>>>>             A =  load 'test.csv' as (a1,a2,a3);
> >>>>>>>>             B = GROUP A by a1;
> >>>>>>>>             -- which preferred?
> >>>>>>>>             C = FOREACH B GENERATE COUNT(A);
> >>>>>>>>             -- or would this only send a single field through the
> >>>>>>>> COUNT
> >>>>>>>>             and be more
> >>>>>>>>             performant?
> >>>>>>>>             C = FOREACH B GENERATE COUNT(A.a2);
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >>>
> >>
> >
>
>
>

Reply via email to