Re: Slow Group By operator

Cheolsoo Park Sun, 25 Aug 2013 12:32:41 -0700

I have no more suggestion. If you find anything, please share with us. I
would be interested in understanding what you're seeing.


On Sun, Aug 25, 2013 at 11:14 AM, Benjamin Jakobus
<jakobusbe...@gmail.com>wrote:

> "combiner + mapPartAgg set to true" - yup!
>
>
> On 25 August 2013 18:57, Cheolsoo Park <piaozhe...@gmail.com> wrote:
>
> > I guess you mean "combiner + mapPartAgg set to true" not "no combiner +
> > mapPartAgg set to true".
> >
> >
> > On Sun, Aug 25, 2013 at 10:10 AM, Benjamin Jakobus
> > <jakobusbe...@gmail.com>wrote:
> >
> > > Hi Cheolsoo,
> > >
> > > Just ran the benchmarks: no luck.
> > >
> > > No combiner + mapPartAgg set to true is slower than without the
> combiner:
> > > real 752.85
> > > real 757.41
> > > real 749.03
> > >
> > >
> > >
> > > On 25 August 2013 17:11, Benjamin Jakobus <jakobusbe...@gmail.com>
> > wrote:
> > >
> > > > Hi Cheolsoo,
> > > >
> > > > Thanks - let's see, I'll give it a try now.
> > > >
> > > > Best Regards,
> > > > Ben
> > > >
> > > >
> > > > On 25 August 2013 02:27, Cheolsoo Park <piaozhe...@gmail.com> wrote:
> > > >
> > > >> Hi Benjamin,
> > > >>
> > > >> Thanks for letting us know. That means my original assumption was
> > wrong.
> > > >> The size of bags is not small. In fact, you can compute the avg size
> > of
> > > >> bags as follows: total number of input records / ( reduce input
> > groups x
> > > >> number of reducers ).
> > > >>
> > > >> One more thing you can try is turning on "pig.exec.mapPartAgg". That
> > may
> > > >> help mappers run faster. If this doesn't work, I run out of ideas.
> :-)
> > > >>
> > > >> Thanks,
> > > >> Cheolsoo
> > > >>
> > > >>
> > > >>
> > > >> On Sat, Aug 24, 2013 at 3:27 AM, Benjamin Jakobus <
> > > jakobusbe...@gmail.com
> > > >> >wrote:
> > > >>
> > > >> > Hi Alan, Cheolsoo,
> > > >> >
> > > >> > I re-ran the benchmarks with and without the combiner. Enabling
> the
> > > >> > combiner is faster:
> > > >> >
> > > >> > With combiner:
> > > >> > real 668.44
> > > >> > real 663.10
> > > >> > real 665.05
> > > >> >
> > > >> > Without combiner:
> > > >> > real 795.97
> > > >> > real 810.51
> > > >> > real 810.16
> > > >> >
> > > >> > Best Regards,
> > > >> > Ben
> > > >> >
> > > >> >
> > > >> > On 22 August 2013 16:33, Cheolsoo Park <piaozhe...@gmail.com>
> > wrote:
> > > >> >
> > > >> > > Hi Benjamin,
> > > >> > >
> > > >> > > To answer your question, how the Hadoop combiner works is that
> 1)
> > > >> mappers
> > > >> > > write outputs to disk and 2) combiners read them, combine and
> > write
> > > >> them
> > > >> > > again. So you're paying extra disk I/O as well as
> > > >> > > serialization/deserialization.
> > > >> > >
> > > >> > > This will pay off if combiners significantly reduce the
> > intermediate
> > > >> > > outputs that reducers need to fetch from mappers. But if
> reduction
> > > is
> > > >> not
> > > >> > > significant, it will only slow down mappers. You can identify
> > > whether
> > > >> > this
> > > >> > > is really a problem by comparing the time spent by map and
> combine
> > > >> > > functions in the task logs.
> > > >> > >
> > > >> > > What I usually do are:
> > > >> > > 1) If there are many small bags, disable combiners.
> > > >> > > 2) If there are many large bags, enable combiners. Furthermore,
> > > >> turning
> > > >> > on
> > > >> > > "pig.exec.mapPartAgg" helps. (see the Pig
> > > >> > > blog<https://blogs.apache.org/pig/entry/apache_pig_it_goes_to
> >for
> > > >> > > details.
> > > >> > > )
> > > >> > >
> > > >> > > Thanks,
> > > >> > > Cheolsoo
> > > >> > >
> > > >> > >
> > > >> > > On Thu, Aug 22, 2013 at 4:01 AM, Benjamin Jakobus <
> > > >> > jakobusbe...@gmail.com
> > > >> > > >wrote:
> > > >> > >
> > > >> > > > Hi Cheolsoo,
> > > >> > > >
> > > >> > > > Thanks - I will try this now and get back to you.
> > > >> > > >
> > > >> > > > Out of interest; could you explain (or point me towards
> > resources
> > > >> that
> > > >> > > > would) why the combiner would be a problem?
> > > >> > > >
> > > >> > > > Also, could the fact that Pig builds an intermediary data
> > > structure
> > > >> (?)
> > > >> > > > whilst Hive just performs a sort then the arithmetic operation
> > > >> explain
> > > >> > > the
> > > >> > > > slowdown?
> > > >> > > >
> > > >> > > > (Apologies, I'm quite new to Pig/Hive - just my guesses).
> > > >> > > >
> > > >> > > > Regards,
> > > >> > > > Benjamin
> > > >> > > >
> > > >> > > >
> > > >> > > > On 22 August 2013 01:07, Cheolsoo Park <piaozhe...@gmail.com>
> > > >> wrote:
> > > >> > > >
> > > >> > > > > Hi Benjamin,
> > > >> > > > >
> > > >> > > > > Thank you very much for sharing detailed information!
> > > >> > > > >
> > > >> > > > > 1) From the runtime numbers that you provided, the mappers
> are
> > > >> very
> > > >> > > slow.
> > > >> > > > >
> > > >> > > > > CPU time spent (ms)5,081,610168,7405,250,350CPU time spent
> > > >> > > (ms)5,052,700
> > > >> > > > > 178,2205,230,920CPU time spent (ms)5,084,430193,4805,277,910
> > > >> > > > >
> > > >> > > > > 2) In your GROUP BY query, you have an algebraic UDF
> "COUNT".
> > > >> > > > >
> > > >> > > > > I am wondering whether disabling combiner will help here. I
> > have
> > > >> > seen a
> > > >> > > > lot
> > > >> > > > > of cases where combiner actually hurt performance
> > significantly
> > > >> if it
> > > >> > > > > doesn't combine mapper outputs significantly. Briefly
> looking
> > at
> > > >> > > > > generate_data.pl in PIG-200, it looks like a lot of random
> > keys
> > > >> are
> > > >> > > > > generated. So I guess you will end up with a large number of
> > > small
> > > >> > bags
> > > >> > > > > rather than a small number of large bags. If that's the
> case,
> > > >> > combiner
> > > >> > > > will
> > > >> > > > > only add overhead to mappers.
> > > >> > > > >
> > > >> > > > > Can you try to include this "set pig.exec.nocombiner true;"
> > and
> > > >> see
> > > >> > > > whether
> > > >> > > > > it helps?
> > > >> > > > >
> > > >> > > > > Thanks,
> > > >> > > > > Cheolsoo
> > > >> > > > >
> > > >> > > > >
> > > >> > > > >
> > > >> > > > >
> > > >> > > > >
> > > >> > > > >
> > > >> > > > > On Wed, Aug 21, 2013 at 3:52 AM, Benjamin Jakobus <
> > > >> > > > jakobusbe...@gmail.com
> > > >> > > > > >wrote:
> > > >> > > > >
> > > >> > > > > > Hi Cheolsoo,
> > > >> > > > > >
> > > >> > > > > > >>What's your query like? Can you share it? Do you call
> any
> > > >> > algebraic
> > > >> > > > UDF
> > > >> > > > > > >> after group by? I am wondering whether combiner matters
> > in
> > > >> your
> > > >> > > > test.
> > > >> > > > > > I have been running 3 different types of queries.
> > > >> > > > > >
> > > >> > > > > > The first was performed on datasets of 6 different sizes:
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > >    - Dataset size 1: 30,000 records (772KB)
> > > >> > > > > >    - Dataset size 2: 300,000 records (6.4MB)
> > > >> > > > > >    - Dataset size 3: 3,000,000 records (63MB)
> > > >> > > > > >    - Dataset size 4: 30 million records (628MB)
> > > >> > > > > >    - Dataset size 5: 300 million records (6.2GB)
> > > >> > > > > >    - Dataset size 6: 3 billion records (62GB)
> > > >> > > > > >
> > > >> > > > > > The datasets scale linearly, whereby the size equates to
> > 3000
> > > *
> > > >> > 10n .
> > > >> > > > > > A seventh dataset consisting of 1,000 records (23KB) was
> > > >> produced
> > > >> > to
> > > >> > > > > > perform join
> > > >> > > > > > operations on. Its schema is as follows:
> > > >> > > > > > name - string
> > > >> > > > > > marks - integer
> > > >> > > > > > gpa - float
> > > >> > > > > > The data was generated using the generate data.pl perl
> > script
> > > >> > > > available
> > > >> > > > > > for
> > > >> > > > > > download
> > > >> > > > > >  from https://issues.apache.org/jira/browse/PIG-200 to
> > > produce
> > > >> the
> > > >> > > > > > datasets. The results are as follows:
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > >  *      * *      * *      * *Set 1      * *Set 2**      *
> > *Set
> > > >> 3**
> > > >> > > >  *
> > > >> > > > > > *Set
> > > >> > > > > > 4**      * *Set 5**      * *Set 6*
> > > >> > > > > > *Arithmetic**      * 32.82*      * 36.21*      * 49.49*
> >  *
> > > >> > 83.25*
> > > >> > > > > >  *
> > > >> > > > > >  423.63*      * 3900.78
> > > >> > > > > > *Filter 10%**      * 32.94*      * 34.32*      * 44.56*
> >  *
> > > >> > 66.68*
> > > >> > > > > >  *
> > > >> > > > > >  295.59*      * 2640.52
> > > >> > > > > > *Filter 90%**      * 33.93*      * 32.55*      * 37.86*
> >  *
> > > >> > 53.22*
> > > >> > > > > >  *
> > > >> > > > > >  197.36*      * 1657.37
> > > >> > > > > > *Group**      * *      *49.43*      * 53.34*      * 69.84*
> > > >>  *
> > > >> > > > 105.12*
> > > >> > > > > >    *497.61*      * 4394.21
> > > >> > > > > > *Join**      * *      *   49.89*      * 50.08*      *
> 78.55*
> > > >>    *
> > > >> > > > > 150.39*
> > > >> > > > > >    *1045.34*     *10258.19
> > > >> > > > > > *Averaged performance of arithmetic, join, group, order,
> > > >> distinct
> > > >> > > > select
> > > >> > > > > > and filter operations on six datasets using Pig. Scripts
> > were
> > > >> > > > configured
> > > >> > > > > as
> > > >> > > > > > to use 8 reduce and 11 map tasks.*
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > >  *      * *              Set 1**      * *Set 2**      *
> *Set
> > > 3**
> > > >> > >  *
> > > >> > > > > > *Set
> > > >> > > > > > 4**      * *Set 5**      * *Set 6*
> > > >> > > > > > *Arithmetic**      *  32.84*      * 37.33*      * 72.55*
> > >  *
> > > >> > > 300.08
> > > >> > > > > >  2633.72    27821.19
> > > >> > > > > > *Filter 10%      *   32.36*      * 53.28*      * 59.22*
> >  *
> > > >> > 209.5*
> > > >> > > > >  *
> > > >> > > > > > 1672.3*     *18222.19
> > > >> > > > > > *Filter 90%      *  31.23*      * 32.68*      *  36.8*
> >  *
> > > >> >  69.55*
> > > >> > > > > >  *
> > > >> > > > > > 331.88*     *3320.59
> > > >> > > > > > *Group      * *      * 48.27*      * 47.68*      * 46.87*
> > >  *
> > > >> > > 53.66*
> > > >> > > > > >  *141.36*     *1233.4
> > > >> > > > > > *Join      * *      * *   *48.54*      *56.86*      *
> 104.6*
> > > >>    *
> > > >> > > > > 517.5*
> > > >> > > > > >    * 4388.34*      * -
> > > >> > > > > > *Distinct**      * *     *48.73*      *53.28*      *
> 72.54*
> > > >>  *
> > > >> > > > > 109.77*
> > > >> > > > > >    * - *      * *      *  -
> > > >> > > > > > *Averaged performance of arithmetic, join, group, distinct
> > > >> select
> > > >> > and
> > > >> > > > > > filter operations on six datasets using Hive. Scripts were
> > > >> > configured
> > > >> > > > as
> > > >> > > > > to
> > > >> > > > > > use 8 reduce and 11 map tasks.*
> > > >> > > > > >
> > > >> > > > > > (If you want to see the standard deviation, let me know).
> > > >> > > > > >
> > > >> > > > > > So, to summarize the results: Pig outperforms Hive, with
> the
> > > >> > > exception
> > > >> > > > of
> > > >> > > > > > using *Group By*.
> > > >> > > > > >
> > > >> > > > > > The Pig scripts used for this benchmark are as follows:
> > > >> > > > > > *Arithmetic*
> > > >> > > > > > -- Generate with basic arithmetic
> > > >> > > > > > A = load '$input/dataset_300000000' using PigStorage('\t')
> > as
> > > >> > (name,
> > > >> > > > age,
> > > >> > > > > > gpa) PARALLEL $reducers;
> > > >> > > > > > B = foreach A generate age * gpa + 3, age/gpa - 1.5
> PARALLEL
> > > >> > > $reducers;
> > > >> > > > > > store B into '$output/dataset_300000000_projection' using
> > > >> > > PigStorage()
> > > >> > > > > > PARALLEL $reducers;
> > > >> > > > > >
> > > >> > > > > > *
> > > >> > > > > > *
> > > >> > > > > > *Filter 10%*
> > > >> > > > > > -- Filter that removes 10% of data
> > > >> > > > > > A = load '$input/dataset_300000000' using PigStorage('\t')
> > as
> > > >> > (name,
> > > >> > > > age,
> > > >> > > > > > gpa) PARALLEL $reducers;
> > > >> > > > > > B = filter A by gpa < '3.6' PARALLEL $reducers;
> > > >> > > > > > store B into '$output/dataset_300000000_filter_10' using
> > > >> > PigStorage()
> > > >> > > > > > PARALLEL $reducers;
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > > *Filter 90%*
> > > >> > > > > > -- Filter that removes 90% of data
> > > >> > > > > > A = load '$input/dataset_300000000' using PigStorage('\t')
> > as
> > > >> > (name,
> > > >> > > > age,
> > > >> > > > > > gpa) PARALLEL $reducers;
> > > >> > > > > > B = filter A by age < '25' PARALLEL $reducers;
> > > >> > > > > > store B into '$output/dataset_300000000_filter_90' using
> > > >> > PigStorage()
> > > >> > > > > > PARALLEL $reducers;
> > > >> > > > > >
> > > >> > > > > > *
> > > >> > > > > > *
> > > >> > > > > > *Group*
> > > >> > > > > > A = load '$input/dataset_300000000' using PigStorage('\t')
> > as
> > > >> > (name,
> > > >> > > > age,
> > > >> > > > > > gpa) PARALLEL $reducers;
> > > >> > > > > > B = group A by name PARALLEL $reducers;
> > > >> > > > > > C = foreach B generate flatten(group), COUNT(A.age)
> PARALLEL
> > > >> > > $reducers;
> > > >> > > > > > store C into '$output/dataset_300000000_group' using
> > > >> PigStorage()
> > > >> > > > > PARALLEL
> > > >> > > > > > $reducers;
> > > >> > > > > > *
> > > >> > > > > > *
> > > >> > > > > > *Join*
> > > >> > > > > > A = load '$input/dataset_300000000' using PigStorage('\t')
> > as
> > > >> > (name,
> > > >> > > > age,
> > > >> > > > > > gpa) PARALLEL $reducers;
> > > >> > > > > > B = load '$input/dataset_join' using PigStorage('\t') as
> > > (name,
> > > >> > age,
> > > >> > > > gpa)
> > > >> > > > > > PARALLEL $reducers;
> > > >> > > > > > C = cogroup A by name inner, B by name inner PARALLEL
> > > $reducers;
> > > >> > > > > > D = foreach C generate flatten(A), flatten(B) PARALLEL
> > > >> $reducers;
> > > >> > > > > > store D into '$output/dataset_300000000_cogroup_big' using
> > > >> > > PigStorage()
> > > >> > > > > > PARALLEL $reducers;
> > > >> > > > > >
> > > >> > > > > > Similarly, here the Hive scripts:
> > > >> > > > > > *Arithmetic*
> > > >> > > > > > SELECT (dataset.age * dataset.gpa + 3) AS F1,
> > > >> > > (dataset.age/dataset.gpa
> > > >> > > > -
> > > >> > > > > > 1.5) AS F2
> > > >> > > > > > FROM dataset
> > > >> > > > > > WHERE dataset.gpa > 0;
> > > >> > > > > >
> > > >> > > > > > *Filter 10%*
> > > >> > > > > > SELECT *
> > > >> > > > > > FROM dataset
> > > >> > > > > > WHERE dataset.gpa < 3.6;
> > > >> > > > > >
> > > >> > > > > > *Filter 90%*
> > > >> > > > > > SELECT *
> > > >> > > > > > FROM dataset
> > > >> > > > > > WHERE dataset.age < 25;
> > > >> > > > > >
> > > >> > > > > > *Group*
> > > >> > > > > > SELECT COUNT(dataset.age)
> > > >> > > > > > FROM dataset
> > > >> > > > > > GROUP BY dataset.name;
> > > >> > > > > >
> > > >> > > > > > *Join*
> > > >> > > > > > SELECT *
> > > >> > > > > > FROM dataset JOIN dataset_join
> > > >> > > > > > ON dataset.name = dataset_join.name;
> > > >> > > > > >
> > > >> > > > > > I will re-run the benchmarks to see whether it is the
> reduce
> > > or
> > > >> map
> > > >> > > > side
> > > >> > > > > > that is slower and get back to you later today.
> > > >> > > > > >
> > > >> > > > > > The other two benchmarks were slightly different: I
> > performed
> > > >> > > > transitive
> > > >> > > > > > self joins in which Pig outperformed Hive. However once I
> > > added
> > > >> a
> > > >> > > Group
> > > >> > > > > By,
> > > >> > > > > > Hive began outperforming Pig.
> > > >> > > > > >
> > > >> > > > > > I also ran the TPC-H benchmarks and noticed that Hive
> > > >> > (surprisingly)
> > > >> > > > > > outperformed Pig. However what *seems* to cause the actual
> > > >> > > performance
> > > >> > > > > > difference is the heavy usage of the Group By operator in
> > all
> > > >> but 3
> > > >> > > > TPC-H
> > > >> > > > > > test scripts.
> > > >> > > > > >
> > > >> > > > > > Re-running the scripts whilst omitting the the grouping of
> > > data
> > > >> > > > produces
> > > >> > > > > > the expected results. For example, running script 3
> > > >> > > > > > (q3_shipping_priority.pig) whilst omitting the Group By
> > > operator
> > > >> > > > > > significantly reduces the runtime (to 1278.49 seconds real
> > > time
> > > >> > > runtime
> > > >> > > > > or
> > > >> > > > > > a total of 12,257,630ms CPU time).
> > > >> > > > > >
> > > >> > > > > > The fact that the Group By operator skews the TPC-H
> > benchmark
> > > in
> > > >> > > favour
> > > >> > > > > of
> > > >> > > > > > Apache Hive is supported by further experiments: as noted
> > > >> earlier a
> > > >> > > > > > benchmark was carried out on a transitive self-join. The
> > > former
> > > >> > took
> > > >> > > > Pig
> > > >> > > > > an
> > > >> > > > > > average of 45.36 seconds (real time runtime) to execute;
> it
> > > took
> > > >> > Hive
> > > >> > > > > 56.73
> > > >> > > > > > seconds. The latter took  Pig 157.97 and Hive 180.19
> seconds
> > > >> > (again,
> > > >> > > on
> > > >> > > > > > average). However adding the Group By operator to the
> > scripts
> > > >> > turned
> > > >> > > > the
> > > >> > > > > > tides: Pig is now significantly slower than Hive,
> requiring
> > an
> > > >> > > average
> > > >> > > > of
> > > >> > > > > > 278.15 seconds. Hive on the other hand required only
> 204.01
> > to
> > > >> > > perform
> > > >> > > > > the
> > > >> > > > > > JOIN and GROUP operations.
> > > >> > > > > >
> > > >> > > > > > Real time runtime is measured using the time -p command.
> > > >> > > > > >
> > > >> > > > > > Best Regards,
> > > >> > > > > > Benjamin
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > > On 20 August 2013 19:56, Cheolsoo Park <
> > piaozhe...@gmail.com>
> > > >> > wrote:
> > > >> > > > > >
> > > >> > > > > > > Hi Benjarmin,
> > > >> > > > > > >
> > > >> > > > > > > Can you describe which step of group by is slow? Mapper
> > side
> > > >> or
> > > >> > > > reducer
> > > >> > > > > > > side?
> > > >> > > > > > >
> > > >> > > > > > > What's your query like? Can you share it? Do you call
> any
> > > >> > algebraic
> > > >> > > > UDF
> > > >> > > > > > > after group by? I am wondering whether combiner matters
> in
> > > >> your
> > > >> > > test.
> > > >> > > > > > >
> > > >> > > > > > > Thanks,
> > > >> > > > > > > Cheolsoo
> > > >> > > > > > >
> > > >> > > > > > >
> > > >> > > > > > >
> > > >> > > > > > >
> > > >> > > > > > > On Tue, Aug 20, 2013 at 2:27 AM, Benjamin Jakobus <
> > > >> > > > > > jakobusbe...@gmail.com
> > > >> > > > > > > >wrote:
> > > >> > > > > > >
> > > >> > > > > > > > Hi all,
> > > >> > > > > > > >
> > > >> > > > > > > > After benchmarking Hive and Pig, I found that the
> Group
> > By
> > > >> > > operator
> > > >> > > > > in
> > > >> > > > > > > Pig
> > > >> > > > > > > > is drastically slower that Hive's. I was wondering
> > whether
> > > >> > > anybody
> > > >> > > > > has
> > > >> > > > > > > > experienced the same? And whether people may have any
> > tips
> > > >> for
> > > >> > > > > > improving
> > > >> > > > > > > > the performance of this operation? (Adding a DISTINCT
> as
> > > >> > > suggested
> > > >> > > > by
> > > >> > > > > > an
> > > >> > > > > > > > earlier post on here doesn't help. I am currently
> > > re-running
> > > >> > the
> > > >> > > > > > > benchmark
> > > >> > > > > > > > with LZO compression enabled).
> > > >> > > > > > > >
> > > >> > > > > > > > Regards,
> > > >> > > > > > > > Ben
> > > >> > > > > > > >
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >
>

Re: Slow Group By operator

Reply via email to