I have no more suggestion. If you find anything, please share with us. I would be interested in understanding what you're seeing.
On Sun, Aug 25, 2013 at 11:14 AM, Benjamin Jakobus <jakobusbe...@gmail.com>wrote: > "combiner + mapPartAgg set to true" - yup! > > > On 25 August 2013 18:57, Cheolsoo Park <piaozhe...@gmail.com> wrote: > > > I guess you mean "combiner + mapPartAgg set to true" not "no combiner + > > mapPartAgg set to true". > > > > > > On Sun, Aug 25, 2013 at 10:10 AM, Benjamin Jakobus > > <jakobusbe...@gmail.com>wrote: > > > > > Hi Cheolsoo, > > > > > > Just ran the benchmarks: no luck. > > > > > > No combiner + mapPartAgg set to true is slower than without the > combiner: > > > real 752.85 > > > real 757.41 > > > real 749.03 > > > > > > > > > > > > On 25 August 2013 17:11, Benjamin Jakobus <jakobusbe...@gmail.com> > > wrote: > > > > > > > Hi Cheolsoo, > > > > > > > > Thanks - let's see, I'll give it a try now. > > > > > > > > Best Regards, > > > > Ben > > > > > > > > > > > > On 25 August 2013 02:27, Cheolsoo Park <piaozhe...@gmail.com> wrote: > > > > > > > >> Hi Benjamin, > > > >> > > > >> Thanks for letting us know. That means my original assumption was > > wrong. > > > >> The size of bags is not small. In fact, you can compute the avg size > > of > > > >> bags as follows: total number of input records / ( reduce input > > groups x > > > >> number of reducers ). > > > >> > > > >> One more thing you can try is turning on "pig.exec.mapPartAgg". That > > may > > > >> help mappers run faster. If this doesn't work, I run out of ideas. > :-) > > > >> > > > >> Thanks, > > > >> Cheolsoo > > > >> > > > >> > > > >> > > > >> On Sat, Aug 24, 2013 at 3:27 AM, Benjamin Jakobus < > > > jakobusbe...@gmail.com > > > >> >wrote: > > > >> > > > >> > Hi Alan, Cheolsoo, > > > >> > > > > >> > I re-ran the benchmarks with and without the combiner. Enabling > the > > > >> > combiner is faster: > > > >> > > > > >> > With combiner: > > > >> > real 668.44 > > > >> > real 663.10 > > > >> > real 665.05 > > > >> > > > > >> > Without combiner: > > > >> > real 795.97 > > > >> > real 810.51 > > > >> > real 810.16 > > > >> > > > > >> > Best Regards, > > > >> > Ben > > > >> > > > > >> > > > > >> > On 22 August 2013 16:33, Cheolsoo Park <piaozhe...@gmail.com> > > wrote: > > > >> > > > > >> > > Hi Benjamin, > > > >> > > > > > >> > > To answer your question, how the Hadoop combiner works is that > 1) > > > >> mappers > > > >> > > write outputs to disk and 2) combiners read them, combine and > > write > > > >> them > > > >> > > again. So you're paying extra disk I/O as well as > > > >> > > serialization/deserialization. > > > >> > > > > > >> > > This will pay off if combiners significantly reduce the > > intermediate > > > >> > > outputs that reducers need to fetch from mappers. But if > reduction > > > is > > > >> not > > > >> > > significant, it will only slow down mappers. You can identify > > > whether > > > >> > this > > > >> > > is really a problem by comparing the time spent by map and > combine > > > >> > > functions in the task logs. > > > >> > > > > > >> > > What I usually do are: > > > >> > > 1) If there are many small bags, disable combiners. > > > >> > > 2) If there are many large bags, enable combiners. Furthermore, > > > >> turning > > > >> > on > > > >> > > "pig.exec.mapPartAgg" helps. (see the Pig > > > >> > > blog<https://blogs.apache.org/pig/entry/apache_pig_it_goes_to > >for > > > >> > > details. > > > >> > > ) > > > >> > > > > > >> > > Thanks, > > > >> > > Cheolsoo > > > >> > > > > > >> > > > > > >> > > On Thu, Aug 22, 2013 at 4:01 AM, Benjamin Jakobus < > > > >> > jakobusbe...@gmail.com > > > >> > > >wrote: > > > >> > > > > > >> > > > Hi Cheolsoo, > > > >> > > > > > > >> > > > Thanks - I will try this now and get back to you. > > > >> > > > > > > >> > > > Out of interest; could you explain (or point me towards > > resources > > > >> that > > > >> > > > would) why the combiner would be a problem? > > > >> > > > > > > >> > > > Also, could the fact that Pig builds an intermediary data > > > structure > > > >> (?) > > > >> > > > whilst Hive just performs a sort then the arithmetic operation > > > >> explain > > > >> > > the > > > >> > > > slowdown? > > > >> > > > > > > >> > > > (Apologies, I'm quite new to Pig/Hive - just my guesses). > > > >> > > > > > > >> > > > Regards, > > > >> > > > Benjamin > > > >> > > > > > > >> > > > > > > >> > > > On 22 August 2013 01:07, Cheolsoo Park <piaozhe...@gmail.com> > > > >> wrote: > > > >> > > > > > > >> > > > > Hi Benjamin, > > > >> > > > > > > > >> > > > > Thank you very much for sharing detailed information! > > > >> > > > > > > > >> > > > > 1) From the runtime numbers that you provided, the mappers > are > > > >> very > > > >> > > slow. > > > >> > > > > > > > >> > > > > CPU time spent (ms)5,081,610168,7405,250,350CPU time spent > > > >> > > (ms)5,052,700 > > > >> > > > > 178,2205,230,920CPU time spent (ms)5,084,430193,4805,277,910 > > > >> > > > > > > > >> > > > > 2) In your GROUP BY query, you have an algebraic UDF > "COUNT". > > > >> > > > > > > > >> > > > > I am wondering whether disabling combiner will help here. I > > have > > > >> > seen a > > > >> > > > lot > > > >> > > > > of cases where combiner actually hurt performance > > significantly > > > >> if it > > > >> > > > > doesn't combine mapper outputs significantly. Briefly > looking > > at > > > >> > > > > generate_data.pl in PIG-200, it looks like a lot of random > > keys > > > >> are > > > >> > > > > generated. So I guess you will end up with a large number of > > > small > > > >> > bags > > > >> > > > > rather than a small number of large bags. If that's the > case, > > > >> > combiner > > > >> > > > will > > > >> > > > > only add overhead to mappers. > > > >> > > > > > > > >> > > > > Can you try to include this "set pig.exec.nocombiner true;" > > and > > > >> see > > > >> > > > whether > > > >> > > > > it helps? > > > >> > > > > > > > >> > > > > Thanks, > > > >> > > > > Cheolsoo > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > On Wed, Aug 21, 2013 at 3:52 AM, Benjamin Jakobus < > > > >> > > > jakobusbe...@gmail.com > > > >> > > > > >wrote: > > > >> > > > > > > > >> > > > > > Hi Cheolsoo, > > > >> > > > > > > > > >> > > > > > >>What's your query like? Can you share it? Do you call > any > > > >> > algebraic > > > >> > > > UDF > > > >> > > > > > >> after group by? I am wondering whether combiner matters > > in > > > >> your > > > >> > > > test. > > > >> > > > > > I have been running 3 different types of queries. > > > >> > > > > > > > > >> > > > > > The first was performed on datasets of 6 different sizes: > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > - Dataset size 1: 30,000 records (772KB) > > > >> > > > > > - Dataset size 2: 300,000 records (6.4MB) > > > >> > > > > > - Dataset size 3: 3,000,000 records (63MB) > > > >> > > > > > - Dataset size 4: 30 million records (628MB) > > > >> > > > > > - Dataset size 5: 300 million records (6.2GB) > > > >> > > > > > - Dataset size 6: 3 billion records (62GB) > > > >> > > > > > > > > >> > > > > > The datasets scale linearly, whereby the size equates to > > 3000 > > > * > > > >> > 10n . > > > >> > > > > > A seventh dataset consisting of 1,000 records (23KB) was > > > >> produced > > > >> > to > > > >> > > > > > perform join > > > >> > > > > > operations on. Its schema is as follows: > > > >> > > > > > name - string > > > >> > > > > > marks - integer > > > >> > > > > > gpa - float > > > >> > > > > > The data was generated using the generate data.pl perl > > script > > > >> > > > available > > > >> > > > > > for > > > >> > > > > > download > > > >> > > > > > from https://issues.apache.org/jira/browse/PIG-200 to > > > produce > > > >> the > > > >> > > > > > datasets. The results are as follows: > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > * * * * * * *Set 1 * *Set 2** * > > *Set > > > >> 3** > > > >> > > > * > > > >> > > > > > *Set > > > >> > > > > > 4** * *Set 5** * *Set 6* > > > >> > > > > > *Arithmetic** * 32.82* * 36.21* * 49.49* > > * > > > >> > 83.25* > > > >> > > > > > * > > > >> > > > > > 423.63* * 3900.78 > > > >> > > > > > *Filter 10%** * 32.94* * 34.32* * 44.56* > > * > > > >> > 66.68* > > > >> > > > > > * > > > >> > > > > > 295.59* * 2640.52 > > > >> > > > > > *Filter 90%** * 33.93* * 32.55* * 37.86* > > * > > > >> > 53.22* > > > >> > > > > > * > > > >> > > > > > 197.36* * 1657.37 > > > >> > > > > > *Group** * * *49.43* * 53.34* * 69.84* > > > >> * > > > >> > > > 105.12* > > > >> > > > > > *497.61* * 4394.21 > > > >> > > > > > *Join** * * * 49.89* * 50.08* * > 78.55* > > > >> * > > > >> > > > > 150.39* > > > >> > > > > > *1045.34* *10258.19 > > > >> > > > > > *Averaged performance of arithmetic, join, group, order, > > > >> distinct > > > >> > > > select > > > >> > > > > > and filter operations on six datasets using Pig. Scripts > > were > > > >> > > > configured > > > >> > > > > as > > > >> > > > > > to use 8 reduce and 11 map tasks.* > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > * * * Set 1** * *Set 2** * > *Set > > > 3** > > > >> > > * > > > >> > > > > > *Set > > > >> > > > > > 4** * *Set 5** * *Set 6* > > > >> > > > > > *Arithmetic** * 32.84* * 37.33* * 72.55* > > > * > > > >> > > 300.08 > > > >> > > > > > 2633.72 27821.19 > > > >> > > > > > *Filter 10% * 32.36* * 53.28* * 59.22* > > * > > > >> > 209.5* > > > >> > > > > * > > > >> > > > > > 1672.3* *18222.19 > > > >> > > > > > *Filter 90% * 31.23* * 32.68* * 36.8* > > * > > > >> > 69.55* > > > >> > > > > > * > > > >> > > > > > 331.88* *3320.59 > > > >> > > > > > *Group * * * 48.27* * 47.68* * 46.87* > > > * > > > >> > > 53.66* > > > >> > > > > > *141.36* *1233.4 > > > >> > > > > > *Join * * * * *48.54* *56.86* * > 104.6* > > > >> * > > > >> > > > > 517.5* > > > >> > > > > > * 4388.34* * - > > > >> > > > > > *Distinct** * * *48.73* *53.28* * > 72.54* > > > >> * > > > >> > > > > 109.77* > > > >> > > > > > * - * * * * - > > > >> > > > > > *Averaged performance of arithmetic, join, group, distinct > > > >> select > > > >> > and > > > >> > > > > > filter operations on six datasets using Hive. Scripts were > > > >> > configured > > > >> > > > as > > > >> > > > > to > > > >> > > > > > use 8 reduce and 11 map tasks.* > > > >> > > > > > > > > >> > > > > > (If you want to see the standard deviation, let me know). > > > >> > > > > > > > > >> > > > > > So, to summarize the results: Pig outperforms Hive, with > the > > > >> > > exception > > > >> > > > of > > > >> > > > > > using *Group By*. > > > >> > > > > > > > > >> > > > > > The Pig scripts used for this benchmark are as follows: > > > >> > > > > > *Arithmetic* > > > >> > > > > > -- Generate with basic arithmetic > > > >> > > > > > A = load '$input/dataset_300000000' using PigStorage('\t') > > as > > > >> > (name, > > > >> > > > age, > > > >> > > > > > gpa) PARALLEL $reducers; > > > >> > > > > > B = foreach A generate age * gpa + 3, age/gpa - 1.5 > PARALLEL > > > >> > > $reducers; > > > >> > > > > > store B into '$output/dataset_300000000_projection' using > > > >> > > PigStorage() > > > >> > > > > > PARALLEL $reducers; > > > >> > > > > > > > > >> > > > > > * > > > >> > > > > > * > > > >> > > > > > *Filter 10%* > > > >> > > > > > -- Filter that removes 10% of data > > > >> > > > > > A = load '$input/dataset_300000000' using PigStorage('\t') > > as > > > >> > (name, > > > >> > > > age, > > > >> > > > > > gpa) PARALLEL $reducers; > > > >> > > > > > B = filter A by gpa < '3.6' PARALLEL $reducers; > > > >> > > > > > store B into '$output/dataset_300000000_filter_10' using > > > >> > PigStorage() > > > >> > > > > > PARALLEL $reducers; > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > *Filter 90%* > > > >> > > > > > -- Filter that removes 90% of data > > > >> > > > > > A = load '$input/dataset_300000000' using PigStorage('\t') > > as > > > >> > (name, > > > >> > > > age, > > > >> > > > > > gpa) PARALLEL $reducers; > > > >> > > > > > B = filter A by age < '25' PARALLEL $reducers; > > > >> > > > > > store B into '$output/dataset_300000000_filter_90' using > > > >> > PigStorage() > > > >> > > > > > PARALLEL $reducers; > > > >> > > > > > > > > >> > > > > > * > > > >> > > > > > * > > > >> > > > > > *Group* > > > >> > > > > > A = load '$input/dataset_300000000' using PigStorage('\t') > > as > > > >> > (name, > > > >> > > > age, > > > >> > > > > > gpa) PARALLEL $reducers; > > > >> > > > > > B = group A by name PARALLEL $reducers; > > > >> > > > > > C = foreach B generate flatten(group), COUNT(A.age) > PARALLEL > > > >> > > $reducers; > > > >> > > > > > store C into '$output/dataset_300000000_group' using > > > >> PigStorage() > > > >> > > > > PARALLEL > > > >> > > > > > $reducers; > > > >> > > > > > * > > > >> > > > > > * > > > >> > > > > > *Join* > > > >> > > > > > A = load '$input/dataset_300000000' using PigStorage('\t') > > as > > > >> > (name, > > > >> > > > age, > > > >> > > > > > gpa) PARALLEL $reducers; > > > >> > > > > > B = load '$input/dataset_join' using PigStorage('\t') as > > > (name, > > > >> > age, > > > >> > > > gpa) > > > >> > > > > > PARALLEL $reducers; > > > >> > > > > > C = cogroup A by name inner, B by name inner PARALLEL > > > $reducers; > > > >> > > > > > D = foreach C generate flatten(A), flatten(B) PARALLEL > > > >> $reducers; > > > >> > > > > > store D into '$output/dataset_300000000_cogroup_big' using > > > >> > > PigStorage() > > > >> > > > > > PARALLEL $reducers; > > > >> > > > > > > > > >> > > > > > Similarly, here the Hive scripts: > > > >> > > > > > *Arithmetic* > > > >> > > > > > SELECT (dataset.age * dataset.gpa + 3) AS F1, > > > >> > > (dataset.age/dataset.gpa > > > >> > > > - > > > >> > > > > > 1.5) AS F2 > > > >> > > > > > FROM dataset > > > >> > > > > > WHERE dataset.gpa > 0; > > > >> > > > > > > > > >> > > > > > *Filter 10%* > > > >> > > > > > SELECT * > > > >> > > > > > FROM dataset > > > >> > > > > > WHERE dataset.gpa < 3.6; > > > >> > > > > > > > > >> > > > > > *Filter 90%* > > > >> > > > > > SELECT * > > > >> > > > > > FROM dataset > > > >> > > > > > WHERE dataset.age < 25; > > > >> > > > > > > > > >> > > > > > *Group* > > > >> > > > > > SELECT COUNT(dataset.age) > > > >> > > > > > FROM dataset > > > >> > > > > > GROUP BY dataset.name; > > > >> > > > > > > > > >> > > > > > *Join* > > > >> > > > > > SELECT * > > > >> > > > > > FROM dataset JOIN dataset_join > > > >> > > > > > ON dataset.name = dataset_join.name; > > > >> > > > > > > > > >> > > > > > I will re-run the benchmarks to see whether it is the > reduce > > > or > > > >> map > > > >> > > > side > > > >> > > > > > that is slower and get back to you later today. > > > >> > > > > > > > > >> > > > > > The other two benchmarks were slightly different: I > > performed > > > >> > > > transitive > > > >> > > > > > self joins in which Pig outperformed Hive. However once I > > > added > > > >> a > > > >> > > Group > > > >> > > > > By, > > > >> > > > > > Hive began outperforming Pig. > > > >> > > > > > > > > >> > > > > > I also ran the TPC-H benchmarks and noticed that Hive > > > >> > (surprisingly) > > > >> > > > > > outperformed Pig. However what *seems* to cause the actual > > > >> > > performance > > > >> > > > > > difference is the heavy usage of the Group By operator in > > all > > > >> but 3 > > > >> > > > TPC-H > > > >> > > > > > test scripts. > > > >> > > > > > > > > >> > > > > > Re-running the scripts whilst omitting the the grouping of > > > data > > > >> > > > produces > > > >> > > > > > the expected results. For example, running script 3 > > > >> > > > > > (q3_shipping_priority.pig) whilst omitting the Group By > > > operator > > > >> > > > > > significantly reduces the runtime (to 1278.49 seconds real > > > time > > > >> > > runtime > > > >> > > > > or > > > >> > > > > > a total of 12,257,630ms CPU time). > > > >> > > > > > > > > >> > > > > > The fact that the Group By operator skews the TPC-H > > benchmark > > > in > > > >> > > favour > > > >> > > > > of > > > >> > > > > > Apache Hive is supported by further experiments: as noted > > > >> earlier a > > > >> > > > > > benchmark was carried out on a transitive self-join. The > > > former > > > >> > took > > > >> > > > Pig > > > >> > > > > an > > > >> > > > > > average of 45.36 seconds (real time runtime) to execute; > it > > > took > > > >> > Hive > > > >> > > > > 56.73 > > > >> > > > > > seconds. The latter took Pig 157.97 and Hive 180.19 > seconds > > > >> > (again, > > > >> > > on > > > >> > > > > > average). However adding the Group By operator to the > > scripts > > > >> > turned > > > >> > > > the > > > >> > > > > > tides: Pig is now significantly slower than Hive, > requiring > > an > > > >> > > average > > > >> > > > of > > > >> > > > > > 278.15 seconds. Hive on the other hand required only > 204.01 > > to > > > >> > > perform > > > >> > > > > the > > > >> > > > > > JOIN and GROUP operations. > > > >> > > > > > > > > >> > > > > > Real time runtime is measured using the time -p command. > > > >> > > > > > > > > >> > > > > > Best Regards, > > > >> > > > > > Benjamin > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > On 20 August 2013 19:56, Cheolsoo Park < > > piaozhe...@gmail.com> > > > >> > wrote: > > > >> > > > > > > > > >> > > > > > > Hi Benjarmin, > > > >> > > > > > > > > > >> > > > > > > Can you describe which step of group by is slow? Mapper > > side > > > >> or > > > >> > > > reducer > > > >> > > > > > > side? > > > >> > > > > > > > > > >> > > > > > > What's your query like? Can you share it? Do you call > any > > > >> > algebraic > > > >> > > > UDF > > > >> > > > > > > after group by? I am wondering whether combiner matters > in > > > >> your > > > >> > > test. > > > >> > > > > > > > > > >> > > > > > > Thanks, > > > >> > > > > > > Cheolsoo > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > On Tue, Aug 20, 2013 at 2:27 AM, Benjamin Jakobus < > > > >> > > > > > jakobusbe...@gmail.com > > > >> > > > > > > >wrote: > > > >> > > > > > > > > > >> > > > > > > > Hi all, > > > >> > > > > > > > > > > >> > > > > > > > After benchmarking Hive and Pig, I found that the > Group > > By > > > >> > > operator > > > >> > > > > in > > > >> > > > > > > Pig > > > >> > > > > > > > is drastically slower that Hive's. I was wondering > > whether > > > >> > > anybody > > > >> > > > > has > > > >> > > > > > > > experienced the same? And whether people may have any > > tips > > > >> for > > > >> > > > > > improving > > > >> > > > > > > > the performance of this operation? (Adding a DISTINCT > as > > > >> > > suggested > > > >> > > > by > > > >> > > > > > an > > > >> > > > > > > > earlier post on here doesn't help. I am currently > > > re-running > > > >> > the > > > >> > > > > > > benchmark > > > >> > > > > > > > with LZO compression enabled). > > > >> > > > > > > > > > > >> > > > > > > > Regards, > > > >> > > > > > > > Ben > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > > > > > > > > > > > >