Ah, I see. Thank you for the explanation & taking the time!! Makes sense.
On 22 August 2013 16:38, Alan Gates <[email protected]> wrote: > When data comes out of a map task, Hadoop serializes it so that it can > know its exact size as it writes it into the output buffer. To run it > through the combiner it needs to deserialize it again, and then > re-serialize it when it comes out. So each pass through the combiner costs > a serialize/deserialization pass, which is expensive and not worth it > unless the data reduction is significant. > > In other words, the combiner can be slow because Java lacks a sizeof > operator. > > Alan. > > On Aug 22, 2013, at 4:01 AM, Benjamin Jakobus wrote: > > > Hi Cheolsoo, > > > > Thanks - I will try this now and get back to you. > > > > Out of interest; could you explain (or point me towards resources that > > would) why the combiner would be a problem? > > > > Also, could the fact that Pig builds an intermediary data structure (?) > > whilst Hive just performs a sort then the arithmetic operation explain > the > > slowdown? > > > > (Apologies, I'm quite new to Pig/Hive - just my guesses). > > > > Regards, > > Benjamin > > > > > > On 22 August 2013 01:07, Cheolsoo Park <[email protected]> wrote: > > > >> Hi Benjamin, > >> > >> Thank you very much for sharing detailed information! > >> > >> 1) From the runtime numbers that you provided, the mappers are very > slow. > >> > >> CPU time spent (ms)5,081,610168,7405,250,350CPU time spent (ms)5,052,700 > >> 178,2205,230,920CPU time spent (ms)5,084,430193,4805,277,910 > >> > >> 2) In your GROUP BY query, you have an algebraic UDF "COUNT". > >> > >> I am wondering whether disabling combiner will help here. I have seen a > lot > >> of cases where combiner actually hurt performance significantly if it > >> doesn't combine mapper outputs significantly. Briefly looking at > >> generate_data.pl in PIG-200, it looks like a lot of random keys are > >> generated. So I guess you will end up with a large number of small bags > >> rather than a small number of large bags. If that's the case, combiner > will > >> only add overhead to mappers. > >> > >> Can you try to include this "set pig.exec.nocombiner true;" and see > whether > >> it helps? > >> > >> Thanks, > >> Cheolsoo > >> > >> > >> > >> > >> > >> > >> On Wed, Aug 21, 2013 at 3:52 AM, Benjamin Jakobus < > [email protected] > >>> wrote: > >> > >>> Hi Cheolsoo, > >>> > >>>>> What's your query like? Can you share it? Do you call any algebraic > UDF > >>>>> after group by? I am wondering whether combiner matters in your test. > >>> I have been running 3 different types of queries. > >>> > >>> The first was performed on datasets of 6 different sizes: > >>> > >>> > >>> - Dataset size 1: 30,000 records (772KB) > >>> - Dataset size 2: 300,000 records (6.4MB) > >>> - Dataset size 3: 3,000,000 records (63MB) > >>> - Dataset size 4: 30 million records (628MB) > >>> - Dataset size 5: 300 million records (6.2GB) > >>> - Dataset size 6: 3 billion records (62GB) > >>> > >>> The datasets scale linearly, whereby the size equates to 3000 * 10n . > >>> A seventh dataset consisting of 1,000 records (23KB) was produced to > >>> perform join > >>> operations on. Its schema is as follows: > >>> name - string > >>> marks - integer > >>> gpa - float > >>> The data was generated using the generate data.pl perl script > available > >>> for > >>> download > >>> from https://issues.apache.org/jira/browse/PIG-200 to produce the > >>> datasets. The results are as follows: > >>> > >>> > >>> * * * * * * *Set 1 * *Set 2** * *Set 3** > * > >>> *Set > >>> 4** * *Set 5** * *Set 6* > >>> *Arithmetic** * 32.82* * 36.21* * 49.49* * 83.25* > >>> * > >>> 423.63* * 3900.78 > >>> *Filter 10%** * 32.94* * 34.32* * 44.56* * 66.68* > >>> * > >>> 295.59* * 2640.52 > >>> *Filter 90%** * 33.93* * 32.55* * 37.86* * 53.22* > >>> * > >>> 197.36* * 1657.37 > >>> *Group** * * *49.43* * 53.34* * 69.84* * > 105.12* > >>> *497.61* * 4394.21 > >>> *Join** * * * 49.89* * 50.08* * 78.55* * > >> 150.39* > >>> *1045.34* *10258.19 > >>> *Averaged performance of arithmetic, join, group, order, distinct > select > >>> and filter operations on six datasets using Pig. Scripts were > configured > >> as > >>> to use 8 reduce and 11 map tasks.* > >>> > >>> > >>> > >>> * * * Set 1** * *Set 2** * *Set 3** * > >>> *Set > >>> 4** * *Set 5** * *Set 6* > >>> *Arithmetic** * 32.84* * 37.33* * 72.55* * 300.08 > >>> 2633.72 27821.19 > >>> *Filter 10% * 32.36* * 53.28* * 59.22* * 209.5* > >> * > >>> 1672.3* *18222.19 > >>> *Filter 90% * 31.23* * 32.68* * 36.8* * 69.55* > >>> * > >>> 331.88* *3320.59 > >>> *Group * * * 48.27* * 47.68* * 46.87* * 53.66* > >>> *141.36* *1233.4 > >>> *Join * * * * *48.54* *56.86* * 104.6* * > >> 517.5* > >>> * 4388.34* * - > >>> *Distinct** * * *48.73* *53.28* * 72.54* * > >> 109.77* > >>> * - * * * * - > >>> *Averaged performance of arithmetic, join, group, distinct select and > >>> filter operations on six datasets using Hive. Scripts were configured > as > >> to > >>> use 8 reduce and 11 map tasks.* > >>> > >>> (If you want to see the standard deviation, let me know). > >>> > >>> So, to summarize the results: Pig outperforms Hive, with the exception > of > >>> using *Group By*. > >>> > >>> The Pig scripts used for this benchmark are as follows: > >>> *Arithmetic* > >>> -- Generate with basic arithmetic > >>> A = load '$input/dataset_300000000' using PigStorage('\t') as (name, > age, > >>> gpa) PARALLEL $reducers; > >>> B = foreach A generate age * gpa + 3, age/gpa - 1.5 PARALLEL $reducers; > >>> store B into '$output/dataset_300000000_projection' using PigStorage() > >>> PARALLEL $reducers; > >>> > >>> * > >>> * > >>> *Filter 10%* > >>> -- Filter that removes 10% of data > >>> A = load '$input/dataset_300000000' using PigStorage('\t') as (name, > age, > >>> gpa) PARALLEL $reducers; > >>> B = filter A by gpa < '3.6' PARALLEL $reducers; > >>> store B into '$output/dataset_300000000_filter_10' using PigStorage() > >>> PARALLEL $reducers; > >>> > >>> > >>> *Filter 90%* > >>> -- Filter that removes 90% of data > >>> A = load '$input/dataset_300000000' using PigStorage('\t') as (name, > age, > >>> gpa) PARALLEL $reducers; > >>> B = filter A by age < '25' PARALLEL $reducers; > >>> store B into '$output/dataset_300000000_filter_90' using PigStorage() > >>> PARALLEL $reducers; > >>> > >>> * > >>> * > >>> *Group* > >>> A = load '$input/dataset_300000000' using PigStorage('\t') as (name, > age, > >>> gpa) PARALLEL $reducers; > >>> B = group A by name PARALLEL $reducers; > >>> C = foreach B generate flatten(group), COUNT(A.age) PARALLEL $reducers; > >>> store C into '$output/dataset_300000000_group' using PigStorage() > >> PARALLEL > >>> $reducers; > >>> * > >>> * > >>> *Join* > >>> A = load '$input/dataset_300000000' using PigStorage('\t') as (name, > age, > >>> gpa) PARALLEL $reducers; > >>> B = load '$input/dataset_join' using PigStorage('\t') as (name, age, > gpa) > >>> PARALLEL $reducers; > >>> C = cogroup A by name inner, B by name inner PARALLEL $reducers; > >>> D = foreach C generate flatten(A), flatten(B) PARALLEL $reducers; > >>> store D into '$output/dataset_300000000_cogroup_big' using PigStorage() > >>> PARALLEL $reducers; > >>> > >>> Similarly, here the Hive scripts: > >>> *Arithmetic* > >>> SELECT (dataset.age * dataset.gpa + 3) AS F1, (dataset.age/dataset.gpa > - > >>> 1.5) AS F2 > >>> FROM dataset > >>> WHERE dataset.gpa > 0; > >>> > >>> *Filter 10%* > >>> SELECT * > >>> FROM dataset > >>> WHERE dataset.gpa < 3.6; > >>> > >>> *Filter 90%* > >>> SELECT * > >>> FROM dataset > >>> WHERE dataset.age < 25; > >>> > >>> *Group* > >>> SELECT COUNT(dataset.age) > >>> FROM dataset > >>> GROUP BY dataset.name; > >>> > >>> *Join* > >>> SELECT * > >>> FROM dataset JOIN dataset_join > >>> ON dataset.name = dataset_join.name; > >>> > >>> I will re-run the benchmarks to see whether it is the reduce or map > side > >>> that is slower and get back to you later today. > >>> > >>> The other two benchmarks were slightly different: I performed > transitive > >>> self joins in which Pig outperformed Hive. However once I added a Group > >> By, > >>> Hive began outperforming Pig. > >>> > >>> I also ran the TPC-H benchmarks and noticed that Hive (surprisingly) > >>> outperformed Pig. However what *seems* to cause the actual performance > >>> difference is the heavy usage of the Group By operator in all but 3 > TPC-H > >>> test scripts. > >>> > >>> Re-running the scripts whilst omitting the the grouping of data > produces > >>> the expected results. For example, running script 3 > >>> (q3_shipping_priority.pig) whilst omitting the Group By operator > >>> significantly reduces the runtime (to 1278.49 seconds real time runtime > >> or > >>> a total of 12,257,630ms CPU time). > >>> > >>> The fact that the Group By operator skews the TPC-H benchmark in favour > >> of > >>> Apache Hive is supported by further experiments: as noted earlier a > >>> benchmark was carried out on a transitive self-join. The former took > Pig > >> an > >>> average of 45.36 seconds (real time runtime) to execute; it took Hive > >> 56.73 > >>> seconds. The latter took Pig 157.97 and Hive 180.19 seconds (again, on > >>> average). However adding the Group By operator to the scripts turned > the > >>> tides: Pig is now significantly slower than Hive, requiring an average > of > >>> 278.15 seconds. Hive on the other hand required only 204.01 to perform > >> the > >>> JOIN and GROUP operations. > >>> > >>> Real time runtime is measured using the time -p command. > >>> > >>> Best Regards, > >>> Benjamin > >>> > >>> > >>> > >>> On 20 August 2013 19:56, Cheolsoo Park <[email protected]> wrote: > >>> > >>>> Hi Benjarmin, > >>>> > >>>> Can you describe which step of group by is slow? Mapper side or > reducer > >>>> side? > >>>> > >>>> What's your query like? Can you share it? Do you call any algebraic > UDF > >>>> after group by? I am wondering whether combiner matters in your test. > >>>> > >>>> Thanks, > >>>> Cheolsoo > >>>> > >>>> > >>>> > >>>> > >>>> On Tue, Aug 20, 2013 at 2:27 AM, Benjamin Jakobus < > >>> [email protected] > >>>>> wrote: > >>>> > >>>>> Hi all, > >>>>> > >>>>> After benchmarking Hive and Pig, I found that the Group By operator > >> in > >>>> Pig > >>>>> is drastically slower that Hive's. I was wondering whether anybody > >> has > >>>>> experienced the same? And whether people may have any tips for > >>> improving > >>>>> the performance of this operation? (Adding a DISTINCT as suggested by > >>> an > >>>>> earlier post on here doesn't help. I am currently re-running the > >>>> benchmark > >>>>> with LZO compression enabled). > >>>>> > >>>>> Regards, > >>>>> Ben > >>>>> > >>>> > >>> > >> > > > -- > CONFIDENTIALITY NOTICE > NOTICE: This message is intended for the use of the individual or entity to > which it is addressed and may contain information that is confidential, > privileged and exempt from disclosure under applicable law. If the reader > of this message is not the intended recipient, you are hereby notified that > any printing, copying, dissemination, distribution, disclosure or > forwarding of this communication is strictly prohibited. If you have > received this communication in error, please contact the sender immediately > and delete it from your system. Thank You. >
