Hi Benjamin, Thank you very much for sharing detailed information!
1) From the runtime numbers that you provided, the mappers are very slow. CPU time spent (ms)5,081,610168,7405,250,350CPU time spent (ms)5,052,700 178,2205,230,920CPU time spent (ms)5,084,430193,4805,277,910 2) In your GROUP BY query, you have an algebraic UDF "COUNT". I am wondering whether disabling combiner will help here. I have seen a lot of cases where combiner actually hurt performance significantly if it doesn't combine mapper outputs significantly. Briefly looking at generate_data.pl in PIG-200, it looks like a lot of random keys are generated. So I guess you will end up with a large number of small bags rather than a small number of large bags. If that's the case, combiner will only add overhead to mappers. Can you try to include this "set pig.exec.nocombiner true;" and see whether it helps? Thanks, Cheolsoo On Wed, Aug 21, 2013 at 3:52 AM, Benjamin Jakobus <[email protected]>wrote: > Hi Cheolsoo, > > >>What's your query like? Can you share it? Do you call any algebraic UDF > >> after group by? I am wondering whether combiner matters in your test. > I have been running 3 different types of queries. > > The first was performed on datasets of 6 different sizes: > > > - Dataset size 1: 30,000 records (772KB) > - Dataset size 2: 300,000 records (6.4MB) > - Dataset size 3: 3,000,000 records (63MB) > - Dataset size 4: 30 million records (628MB) > - Dataset size 5: 300 million records (6.2GB) > - Dataset size 6: 3 billion records (62GB) > > The datasets scale linearly, whereby the size equates to 3000 * 10n . > A seventh dataset consisting of 1,000 records (23KB) was produced to > perform join > operations on. Its schema is as follows: > name - string > marks - integer > gpa - float > The data was generated using the generate data.pl perl script available > for > download > from https://issues.apache.org/jira/browse/PIG-200 to produce the > datasets. The results are as follows: > > > * * * * * * *Set 1 * *Set 2** * *Set 3** * > *Set > 4** * *Set 5** * *Set 6* > *Arithmetic** * 32.82* * 36.21* * 49.49* * 83.25* > * > 423.63* * 3900.78 > *Filter 10%** * 32.94* * 34.32* * 44.56* * 66.68* > * > 295.59* * 2640.52 > *Filter 90%** * 33.93* * 32.55* * 37.86* * 53.22* > * > 197.36* * 1657.37 > *Group** * * *49.43* * 53.34* * 69.84* * 105.12* > *497.61* * 4394.21 > *Join** * * * 49.89* * 50.08* * 78.55* * 150.39* > *1045.34* *10258.19 > *Averaged performance of arithmetic, join, group, order, distinct select > and filter operations on six datasets using Pig. Scripts were configured as > to use 8 reduce and 11 map tasks.* > > > > * * * Set 1** * *Set 2** * *Set 3** * > *Set > 4** * *Set 5** * *Set 6* > *Arithmetic** * 32.84* * 37.33* * 72.55* * 300.08 > 2633.72 27821.19 > *Filter 10% * 32.36* * 53.28* * 59.22* * 209.5* * > 1672.3* *18222.19 > *Filter 90% * 31.23* * 32.68* * 36.8* * 69.55* > * > 331.88* *3320.59 > *Group * * * 48.27* * 47.68* * 46.87* * 53.66* > *141.36* *1233.4 > *Join * * * * *48.54* *56.86* * 104.6* * 517.5* > * 4388.34* * - > *Distinct** * * *48.73* *53.28* * 72.54* * 109.77* > * - * * * * - > *Averaged performance of arithmetic, join, group, distinct select and > filter operations on six datasets using Hive. Scripts were configured as to > use 8 reduce and 11 map tasks.* > > (If you want to see the standard deviation, let me know). > > So, to summarize the results: Pig outperforms Hive, with the exception of > using *Group By*. > > The Pig scripts used for this benchmark are as follows: > *Arithmetic* > -- Generate with basic arithmetic > A = load '$input/dataset_300000000' using PigStorage('\t') as (name, age, > gpa) PARALLEL $reducers; > B = foreach A generate age * gpa + 3, age/gpa - 1.5 PARALLEL $reducers; > store B into '$output/dataset_300000000_projection' using PigStorage() > PARALLEL $reducers; > > * > * > *Filter 10%* > -- Filter that removes 10% of data > A = load '$input/dataset_300000000' using PigStorage('\t') as (name, age, > gpa) PARALLEL $reducers; > B = filter A by gpa < '3.6' PARALLEL $reducers; > store B into '$output/dataset_300000000_filter_10' using PigStorage() > PARALLEL $reducers; > > > *Filter 90%* > -- Filter that removes 90% of data > A = load '$input/dataset_300000000' using PigStorage('\t') as (name, age, > gpa) PARALLEL $reducers; > B = filter A by age < '25' PARALLEL $reducers; > store B into '$output/dataset_300000000_filter_90' using PigStorage() > PARALLEL $reducers; > > * > * > *Group* > A = load '$input/dataset_300000000' using PigStorage('\t') as (name, age, > gpa) PARALLEL $reducers; > B = group A by name PARALLEL $reducers; > C = foreach B generate flatten(group), COUNT(A.age) PARALLEL $reducers; > store C into '$output/dataset_300000000_group' using PigStorage() PARALLEL > $reducers; > * > * > *Join* > A = load '$input/dataset_300000000' using PigStorage('\t') as (name, age, > gpa) PARALLEL $reducers; > B = load '$input/dataset_join' using PigStorage('\t') as (name, age, gpa) > PARALLEL $reducers; > C = cogroup A by name inner, B by name inner PARALLEL $reducers; > D = foreach C generate flatten(A), flatten(B) PARALLEL $reducers; > store D into '$output/dataset_300000000_cogroup_big' using PigStorage() > PARALLEL $reducers; > > Similarly, here the Hive scripts: > *Arithmetic* > SELECT (dataset.age * dataset.gpa + 3) AS F1, (dataset.age/dataset.gpa - > 1.5) AS F2 > FROM dataset > WHERE dataset.gpa > 0; > > *Filter 10%* > SELECT * > FROM dataset > WHERE dataset.gpa < 3.6; > > *Filter 90%* > SELECT * > FROM dataset > WHERE dataset.age < 25; > > *Group* > SELECT COUNT(dataset.age) > FROM dataset > GROUP BY dataset.name; > > *Join* > SELECT * > FROM dataset JOIN dataset_join > ON dataset.name = dataset_join.name; > > I will re-run the benchmarks to see whether it is the reduce or map side > that is slower and get back to you later today. > > The other two benchmarks were slightly different: I performed transitive > self joins in which Pig outperformed Hive. However once I added a Group By, > Hive began outperforming Pig. > > I also ran the TPC-H benchmarks and noticed that Hive (surprisingly) > outperformed Pig. However what *seems* to cause the actual performance > difference is the heavy usage of the Group By operator in all but 3 TPC-H > test scripts. > > Re-running the scripts whilst omitting the the grouping of data produces > the expected results. For example, running script 3 > (q3_shipping_priority.pig) whilst omitting the Group By operator > significantly reduces the runtime (to 1278.49 seconds real time runtime or > a total of 12,257,630ms CPU time). > > The fact that the Group By operator skews the TPC-H benchmark in favour of > Apache Hive is supported by further experiments: as noted earlier a > benchmark was carried out on a transitive self-join. The former took Pig an > average of 45.36 seconds (real time runtime) to execute; it took Hive 56.73 > seconds. The latter took Pig 157.97 and Hive 180.19 seconds (again, on > average). However adding the Group By operator to the scripts turned the > tides: Pig is now significantly slower than Hive, requiring an average of > 278.15 seconds. Hive on the other hand required only 204.01 to perform the > JOIN and GROUP operations. > > Real time runtime is measured using the time -p command. > > Best Regards, > Benjamin > > > > On 20 August 2013 19:56, Cheolsoo Park <[email protected]> wrote: > > > Hi Benjarmin, > > > > Can you describe which step of group by is slow? Mapper side or reducer > > side? > > > > What's your query like? Can you share it? Do you call any algebraic UDF > > after group by? I am wondering whether combiner matters in your test. > > > > Thanks, > > Cheolsoo > > > > > > > > > > On Tue, Aug 20, 2013 at 2:27 AM, Benjamin Jakobus < > [email protected] > > >wrote: > > > > > Hi all, > > > > > > After benchmarking Hive and Pig, I found that the Group By operator in > > Pig > > > is drastically slower that Hive's. I was wondering whether anybody has > > > experienced the same? And whether people may have any tips for > improving > > > the performance of this operation? (Adding a DISTINCT as suggested by > an > > > earlier post on here doesn't help. I am currently re-running the > > benchmark > > > with LZO compression enabled). > > > > > > Regards, > > > Ben > > > > > >
