Re: Slow Group By operator

Benjamin Jakobus Sat, 24 Aug 2013 03:12:40 -0700

Ah, I see. Thank you for the explanation & taking the time!! Makes sense.



On 22 August 2013 16:38, Alan Gates <[email protected]> wrote:

> When data comes out of a map task, Hadoop serializes it so that it can
> know its exact size as it writes it into the output buffer.  To run it
> through the combiner it needs to deserialize it again, and then
> re-serialize it when it comes out.  So each pass through the combiner costs
> a serialize/deserialization pass, which is expensive and not worth it
> unless the data reduction is significant.
>
> In other words, the combiner can be slow because Java lacks a sizeof
> operator.
>
> Alan.
>
> On Aug 22, 2013, at 4:01 AM, Benjamin Jakobus wrote:
>
> > Hi Cheolsoo,
> >
> > Thanks - I will try this now and get back to you.
> >
> > Out of interest; could you explain (or point me towards resources that
> > would) why the combiner would be a problem?
> >
> > Also, could the fact that Pig builds an intermediary data structure (?)
> > whilst Hive just performs a sort then the arithmetic operation explain
> the
> > slowdown?
> >
> > (Apologies, I'm quite new to Pig/Hive - just my guesses).
> >
> > Regards,
> > Benjamin
> >
> >
> > On 22 August 2013 01:07, Cheolsoo Park <[email protected]> wrote:
> >
> >> Hi Benjamin,
> >>
> >> Thank you very much for sharing detailed information!
> >>
> >> 1) From the runtime numbers that you provided, the mappers are very
> slow.
> >>
> >> CPU time spent (ms)5,081,610168,7405,250,350CPU time spent (ms)5,052,700
> >> 178,2205,230,920CPU time spent (ms)5,084,430193,4805,277,910
> >>
> >> 2) In your GROUP BY query, you have an algebraic UDF "COUNT".
> >>
> >> I am wondering whether disabling combiner will help here. I have seen a
> lot
> >> of cases where combiner actually hurt performance significantly if it
> >> doesn't combine mapper outputs significantly. Briefly looking at
> >> generate_data.pl in PIG-200, it looks like a lot of random keys are
> >> generated. So I guess you will end up with a large number of small bags
> >> rather than a small number of large bags. If that's the case, combiner
> will
> >> only add overhead to mappers.
> >>
> >> Can you try to include this "set pig.exec.nocombiner true;" and see
> whether
> >> it helps?
> >>
> >> Thanks,
> >> Cheolsoo
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Wed, Aug 21, 2013 at 3:52 AM, Benjamin Jakobus <
> [email protected]
> >>> wrote:
> >>
> >>> Hi Cheolsoo,
> >>>
> >>>>> What's your query like? Can you share it? Do you call any algebraic
> UDF
> >>>>> after group by? I am wondering whether combiner matters in your test.
> >>> I have been running 3 different types of queries.
> >>>
> >>> The first was performed on datasets of 6 different sizes:
> >>>
> >>>
> >>>   - Dataset size 1: 30,000 records (772KB)
> >>>   - Dataset size 2: 300,000 records (6.4MB)
> >>>   - Dataset size 3: 3,000,000 records (63MB)
> >>>   - Dataset size 4: 30 million records (628MB)
> >>>   - Dataset size 5: 300 million records (6.2GB)
> >>>   - Dataset size 6: 3 billion records (62GB)
> >>>
> >>> The datasets scale linearly, whereby the size equates to 3000 * 10n .
> >>> A seventh dataset consisting of 1,000 records (23KB) was produced to
> >>> perform join
> >>> operations on. Its schema is as follows:
> >>> name - string
> >>> marks - integer
> >>> gpa - float
> >>> The data was generated using the generate data.pl perl script
> available
> >>> for
> >>> download
> >>> from https://issues.apache.org/jira/browse/PIG-200 to produce the
> >>> datasets. The results are as follows:
> >>>
> >>>
> >>> *      * *      * *      * *Set 1      * *Set 2**      * *Set 3**
>  *
> >>> *Set
> >>> 4**      * *Set 5**      * *Set 6*
> >>> *Arithmetic**      * 32.82*      * 36.21*      * 49.49*      * 83.25*
> >>> *
> >>> 423.63*      * 3900.78
> >>> *Filter 10%**      * 32.94*      * 34.32*      * 44.56*      * 66.68*
> >>> *
> >>> 295.59*      * 2640.52
> >>> *Filter 90%**      * 33.93*      * 32.55*      * 37.86*      * 53.22*
> >>> *
> >>> 197.36*      * 1657.37
> >>> *Group**      * *      *49.43*      * 53.34*      * 69.84*      *
> 105.12*
> >>>   *497.61*      * 4394.21
> >>> *Join**      * *      *   49.89*      * 50.08*      * 78.55*      *
> >> 150.39*
> >>>   *1045.34*     *10258.19
> >>> *Averaged performance of arithmetic, join, group, order, distinct
> select
> >>> and filter operations on six datasets using Pig. Scripts were
> configured
> >> as
> >>> to use 8 reduce and 11 map tasks.*
> >>>
> >>>
> >>>
> >>> *      * *              Set 1**      * *Set 2**      * *Set 3**      *
> >>> *Set
> >>> 4**      * *Set 5**      * *Set 6*
> >>> *Arithmetic**      *  32.84*      * 37.33*      * 72.55*      * 300.08
> >>> 2633.72    27821.19
> >>> *Filter 10%      *   32.36*      * 53.28*      * 59.22*      * 209.5*
> >> *
> >>> 1672.3*     *18222.19
> >>> *Filter 90%      *  31.23*      * 32.68*      *  36.8*      *  69.55*
> >>> *
> >>> 331.88*     *3320.59
> >>> *Group      * *      * 48.27*      * 47.68*      * 46.87*      * 53.66*
> >>> *141.36*     *1233.4
> >>> *Join      * *      * *   *48.54*      *56.86*      * 104.6*      *
> >> 517.5*
> >>>   * 4388.34*      * -
> >>> *Distinct**      * *     *48.73*      *53.28*      * 72.54*      *
> >> 109.77*
> >>>   * - *      * *      *  -
> >>> *Averaged performance of arithmetic, join, group, distinct select and
> >>> filter operations on six datasets using Hive. Scripts were configured
> as
> >> to
> >>> use 8 reduce and 11 map tasks.*
> >>>
> >>> (If you want to see the standard deviation, let me know).
> >>>
> >>> So, to summarize the results: Pig outperforms Hive, with the exception
> of
> >>> using *Group By*.
> >>>
> >>> The Pig scripts used for this benchmark are as follows:
> >>> *Arithmetic*
> >>> -- Generate with basic arithmetic
> >>> A = load '$input/dataset_300000000' using PigStorage('\t') as (name,
> age,
> >>> gpa) PARALLEL $reducers;
> >>> B = foreach A generate age * gpa + 3, age/gpa - 1.5 PARALLEL $reducers;
> >>> store B into '$output/dataset_300000000_projection' using PigStorage()
> >>> PARALLEL $reducers;
> >>>
> >>> *
> >>> *
> >>> *Filter 10%*
> >>> -- Filter that removes 10% of data
> >>> A = load '$input/dataset_300000000' using PigStorage('\t') as (name,
> age,
> >>> gpa) PARALLEL $reducers;
> >>> B = filter A by gpa < '3.6' PARALLEL $reducers;
> >>> store B into '$output/dataset_300000000_filter_10' using PigStorage()
> >>> PARALLEL $reducers;
> >>>
> >>>
> >>> *Filter 90%*
> >>> -- Filter that removes 90% of data
> >>> A = load '$input/dataset_300000000' using PigStorage('\t') as (name,
> age,
> >>> gpa) PARALLEL $reducers;
> >>> B = filter A by age < '25' PARALLEL $reducers;
> >>> store B into '$output/dataset_300000000_filter_90' using PigStorage()
> >>> PARALLEL $reducers;
> >>>
> >>> *
> >>> *
> >>> *Group*
> >>> A = load '$input/dataset_300000000' using PigStorage('\t') as (name,
> age,
> >>> gpa) PARALLEL $reducers;
> >>> B = group A by name PARALLEL $reducers;
> >>> C = foreach B generate flatten(group), COUNT(A.age) PARALLEL $reducers;
> >>> store C into '$output/dataset_300000000_group' using PigStorage()
> >> PARALLEL
> >>> $reducers;
> >>> *
> >>> *
> >>> *Join*
> >>> A = load '$input/dataset_300000000' using PigStorage('\t') as (name,
> age,
> >>> gpa) PARALLEL $reducers;
> >>> B = load '$input/dataset_join' using PigStorage('\t') as (name, age,
> gpa)
> >>> PARALLEL $reducers;
> >>> C = cogroup A by name inner, B by name inner PARALLEL $reducers;
> >>> D = foreach C generate flatten(A), flatten(B) PARALLEL $reducers;
> >>> store D into '$output/dataset_300000000_cogroup_big' using PigStorage()
> >>> PARALLEL $reducers;
> >>>
> >>> Similarly, here the Hive scripts:
> >>> *Arithmetic*
> >>> SELECT (dataset.age * dataset.gpa + 3) AS F1, (dataset.age/dataset.gpa
> -
> >>> 1.5) AS F2
> >>> FROM dataset
> >>> WHERE dataset.gpa > 0;
> >>>
> >>> *Filter 10%*
> >>> SELECT *
> >>> FROM dataset
> >>> WHERE dataset.gpa < 3.6;
> >>>
> >>> *Filter 90%*
> >>> SELECT *
> >>> FROM dataset
> >>> WHERE dataset.age < 25;
> >>>
> >>> *Group*
> >>> SELECT COUNT(dataset.age)
> >>> FROM dataset
> >>> GROUP BY dataset.name;
> >>>
> >>> *Join*
> >>> SELECT *
> >>> FROM dataset JOIN dataset_join
> >>> ON dataset.name = dataset_join.name;
> >>>
> >>> I will re-run the benchmarks to see whether it is the reduce or map
> side
> >>> that is slower and get back to you later today.
> >>>
> >>> The other two benchmarks were slightly different: I performed
> transitive
> >>> self joins in which Pig outperformed Hive. However once I added a Group
> >> By,
> >>> Hive began outperforming Pig.
> >>>
> >>> I also ran the TPC-H benchmarks and noticed that Hive (surprisingly)
> >>> outperformed Pig. However what *seems* to cause the actual performance
> >>> difference is the heavy usage of the Group By operator in all but 3
> TPC-H
> >>> test scripts.
> >>>
> >>> Re-running the scripts whilst omitting the the grouping of data
> produces
> >>> the expected results. For example, running script 3
> >>> (q3_shipping_priority.pig) whilst omitting the Group By operator
> >>> significantly reduces the runtime (to 1278.49 seconds real time runtime
> >> or
> >>> a total of 12,257,630ms CPU time).
> >>>
> >>> The fact that the Group By operator skews the TPC-H benchmark in favour
> >> of
> >>> Apache Hive is supported by further experiments: as noted earlier a
> >>> benchmark was carried out on a transitive self-join. The former took
> Pig
> >> an
> >>> average of 45.36 seconds (real time runtime) to execute; it took Hive
> >> 56.73
> >>> seconds. The latter took  Pig 157.97 and Hive 180.19 seconds (again, on
> >>> average). However adding the Group By operator to the scripts turned
> the
> >>> tides: Pig is now significantly slower than Hive, requiring an average
> of
> >>> 278.15 seconds. Hive on the other hand required only 204.01 to perform
> >> the
> >>> JOIN and GROUP operations.
> >>>
> >>> Real time runtime is measured using the time -p command.
> >>>
> >>> Best Regards,
> >>> Benjamin
> >>>
> >>>
> >>>
> >>> On 20 August 2013 19:56, Cheolsoo Park <[email protected]> wrote:
> >>>
> >>>> Hi Benjarmin,
> >>>>
> >>>> Can you describe which step of group by is slow? Mapper side or
> reducer
> >>>> side?
> >>>>
> >>>> What's your query like? Can you share it? Do you call any algebraic
> UDF
> >>>> after group by? I am wondering whether combiner matters in your test.
> >>>>
> >>>> Thanks,
> >>>> Cheolsoo
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Tue, Aug 20, 2013 at 2:27 AM, Benjamin Jakobus <
> >>> [email protected]
> >>>>> wrote:
> >>>>
> >>>>> Hi all,
> >>>>>
> >>>>> After benchmarking Hive and Pig, I found that the Group By operator
> >> in
> >>>> Pig
> >>>>> is drastically slower that Hive's. I was wondering whether anybody
> >> has
> >>>>> experienced the same? And whether people may have any tips for
> >>> improving
> >>>>> the performance of this operation? (Adding a DISTINCT as suggested by
> >>> an
> >>>>> earlier post on here doesn't help. I am currently re-running the
> >>>> benchmark
> >>>>> with LZO compression enabled).
> >>>>>
> >>>>> Regards,
> >>>>> Ben
> >>>>>
> >>>>
> >>>
> >>
>
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>

Re: Slow Group By operator

Reply via email to