[ 
https://issues.apache.org/jira/browse/PIG-200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13260155#comment-13260155
 ] 

Jie Li commented on PIG-200:
----------------------------

Did anyone notice that pig-0.9.0 uses an extra job for L9 than pig-0.8.1?

L9 is very simple: (slightly changed for setting the default_parallel)
{code}
SET default_parallel $factor

register ../pigperf.jar;
A = load '$input/pigmix_page_views' using 
org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
    as (user, action, timespent, query_term, ip_addr, timestamp,
        estimated_revenue, page_info, page_links);
B = order A by query_term;
store B into '$output/L9out';
{code}


{panel:title=Output information of 0.8.1}
JobId   Maps    Reduces MaxMapTime      MinMapTIme      AvgMapTime      
MaxReduceTime   MinReduceTime   AvgReduceTime   Alias   Feature Outputs
job_201204222028_0192   60      1       33      12      20      102     102     
102     B       SAMPLER 
job_201204222028_0193   60      17      78      39      57      533     147     
360     B       ORDER_BY        /tmp/10m-0.8.1/L9out,
{panel}

{panel:title=Output information of 0.9.0}
JobId   Maps    Reduces MaxMapTime      MinMapTIme      AvgMapTime      
MaxReduceTime   MinReduceTime   AvgReduceTime   Alias
   Feature Outputs
job_201204222028_0269   60      0       171     27      116     0       0       
0       A       MAP_ONLY        
job_201204222028_0270   60      1       63      9       26      136     136     
136     B       SAMPLER 
job_201204222028_0271   60      17      183     30      66      657     262     
446     B       ORDER_BY        /tmp/10m-0.9.0/L9out,
{panel}
We can see, 0.9.0 uses a MAP_ONLY job to load the data, which is almost as 
expensive as the ORDER_BY job. In my environment with 4 slave nodes processing 
10m records, it increases time from 1021 seconds (0.8.1) to 1921 seconds 
(0.9.0)!

Does anybody know what happened?
                
> Pig Performance Benchmarks
> --------------------------
>
>                 Key: PIG-200
>                 URL: https://issues.apache.org/jira/browse/PIG-200
>             Project: Pig
>          Issue Type: Task
>            Reporter: Amir Youssefi
>            Assignee: Alan Gates
>             Fix For: 0.2.0
>
>         Attachments: generate_data.pl, perf-0.6.patch, perf.hadoop.patch, 
> perf.patch, pigmix2.patch, pigmix_pig0.11.patch
>
>
> To benchmark Pig performance, we need to have a TPC-H like Large Data Set 
> plus Script Collection. This is used in comparison of different Pig releases, 
> Pig vs. other systems (e.g. Pig + Hadoop vs. Hadoop Only).
> Here is Wiki for small tests: http://wiki.apache.org/pig/PigPerformance
> I am currently running long-running Pig scripts over data-sets in the order 
> of tens of TBs. Next step is hundreds of TBs.
> We need to have an open large-data set (open source scripts which generate 
> data-set) and detailed scripts for important operations such as ORDER, 
> AGGREGATION etc.
> We can call those the Pig Workouts: Cardio (short processing), Marathon (long 
> running scripts) and Triathlon (Mix). 
> I will update this JIRA with more details of current activities soon.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to