Two questions relating to that: 1) we currently hardcode parallel 40 in pigmix. Since Pig can now automatically select parallelism, would it be better to let it do so?
2) I noticed that L17 can be greatly optimized. Currently it does this: register pigperf.jar; %default PIGMIX_DIR /user/pig/tests/data/pigmix A = load '$PIGMIX_DIR/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links); B = foreach A generate user, timestamp; C = group B by user; D = foreach C { morning = filter B by timestamp < 43200; afternoon = filter B by timestamp >= 43200; generate group, COUNT(morning), COUNT(afternoon); } store D into 'L7out'; It can be improved to use combiners: register pigperf.jar; %default PIGMIX_DIR /user/pig/tests/data/pigmix A = load '$PIGMIX_DIR/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links); B = foreach A generate user, timestamp, (timestamp < 43200 ? 1 : 0) as morning, (timestamp >= 43200 ? 1 : 0) as afternoon; C = group B by user; D = foreach C { generate group, SUM(B.morning), SUM(B,afternoon); } store D into 'L7out'; Is L17 supposed to test something that precludes the use of combiners, or is improving the query fair game? D