Re: asking for comments on benchmark queries

Alan Gates Tue, 23 Jun 2009 10:32:30 -0700

Zheng,

I don't think you're subscribed to pig-dev (your emails have beenbouncing to the moderator). So I've cc'd you explicitly on this.

I don't think we need a Pig JIRA, it's probably easier if we all workon the hive one. I'll post my comments on the various scripts to thatbug. I've also attached them here since pig-dev won't see the updatesto that bug.


Alan.

grep_select.pig:

Adding types in the LOAD statement will force Pig to cast the keyfield, even though it doesn't need to (it only reads and writes thekey field). So I'd change the query to be:


rmf output/PIG_bench/grep_select;
a = load '/data/grep/*' using PigStorage as (key,field);
b = filter a by field matches '.*XYZ.*';
store b into 'output/PIG_bench/grep_select';

field will still be cast to a chararray for the matches, but we won'twaste time casting key and then turning it back into bytes for thestore.


rankings_select.pig:

Same comment, remove the casts. pagerank will be properly cast to aninteger.


rmf output/PIG_bench/rankings_select;

a = load '/data/rankings/*' using PigStorage('|') as(pagerank,pageurl,aveduration);

b = filter a by pagerank > 10;
store b into 'output/PIG_bench/rankings_select';

rankings_uservisits_join.pig:

Here you want to keep the casts of pagerank so that it is handled asthe right type. adRevenue will default to double in SUM when youdon't specify a type. You also want to project out all unneededcolumns as soon as possible. You should set PARALLEL on the join touse the number of reducers appropriate for your cluster. Given thatyou have 10 machines and 5 reduce slots per machine, and speculativeexecution is off you probably want 50 reducers. I notice you setparallel to 60 on the group by. That will give you 10 trailingreducers. Unless you have a need for the result to be split 60 waysyou should reduce that to 50 as well. (I'm assuming here when you sayyou have a 10 node cluster you mean 10 data nodes, not counting yourname node and task tracker. The reduce formula should be 5 * numberof data nodes.)

A last question is how large are the uservisits and rankings datasets? If either is < 80M or so you can use the fragment/replicatejoin, which is much faster than the general join. The followingscript assumes that isn't the case; but if it is let me know and I canshow you the syntax for it.


So the end query looks like:

rmf output/PIG_bench/html_join;
a = load '/data/uservisits/*' using PigStorage('|') as

(sourceIP,destURL,visitDate,adRevenue,userAgent,countryCode,languageCode:,searchWord,duration);b = load '/data/rankings/*' using PigStorage('|') as(pagerank:int,pageurl,aveduration);

c = filter a by visitDate > '1999-01-01' AND visitDate < '2000-01-01';
c1 = fjjkkoreach c generate sourceIP, destURL, addRevenue;
b1 = foreach b generate pagerank, pageurl;
d = JOIN c1 by destURL, b1 by pageurl parallel 50;
d1 = foreach d generate sourceIP, pagerank, adRevenue;
e = group d1 by sourceIP parallel 50;
f = FOREACH e GENERATE group, AVG(d1.pagerank), SUM(d1.adRevenue);
store f into 'output/PIG_bench/html_join';

uservisists_agrre.pig:

Same comments as above on projecting out as early as possible and onsetting parallel appropriately for your cluster.


rmf output/PIG_bench/uservisits_aggre;
a = load '/data/uservisits/*' using PigStorage('|') as

(sourceIP,destURL,visitDate,adRevenue,userAgent,countryCode,languageCode,searchWord,duration);

a1 = foreach a generate sourceIP, adRevenue;
b = group a by sourceIP parallel 50;
c = FOREACH b GENERATE group, SUM(a. adRevenue);
store c into 'output/PIG_bench/uservisits_aggre';



On Jun 22, 2009, at 10:36 PM, Zheng Shao wrote:

Hi Pig team,
We’d like to get your feedback on a set of queries we implemented onPig.
We’ve attached the hadoop configuration and pig queries in theemail. We start the queries by issuing “pig xxx.pig”. The queriesare from SIGMOD’2009 paper. More details are athttps://issues.apache.org/jira/browse/HIVE-396 (Shall we open a JIRA on PIGfor this?)
One improvement is that we are going to change hadoop to use LZO asintermediate compression algorithm very soon. Previously we usedgzip for all performance tests including hadoop, hive and pig.
The reason that we specify the number of reducers in the query is totry to match the same number of reducer as Hive automaticallysuggested. Please let us know what is the best way to set the numberof reducers in Pig.
Are there any other improvements we can make to the Pig query andthe hadoop configuration?
Thanks,
Zheng

<hadoop-site.xml><hive-default.xml><hadoop-env.sh.txt>

Re: asking for comments on benchmark queries

Reply via email to