[ 
https://issues.apache.org/jira/browse/PIG-979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776760#action_12776760
 ] 

Ying He commented on PIG-979:
-----------------------------

performance tests doesn't show noticeable difference between trunk and 
accumulator patch when calling no-accumulator udfs.

the script to test performance is:

register /homes/yinghe/pig_test/pigperf.jar;
register /homes/yinghe/pig_test/string.jar;
register /homes/yinghe/pig_test/piggybank.jar;

A = load '/user/pig/tests/data/pigmix_large/page_views' using 
org.apache.pig.test.utils.datagen.PigPerformanceLoader() as (user, action, 
timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, 
page_links);

B = foreach A generate user, 
org.apache.pig.piggybank.evaluation.string.STRINGCAT(user, ip_addr) as id;

C = group B by id parallel 10;

D = foreach C {
    generate group, string.BagCount2(B)*string.ColumnLen2(B, 0);
}

store D into 'test2';

The input data has 100M rows, output has 57M rows, so the UDFs are called 57M 
times.
The result is

 with patch:  5min 14sec
 w/o patch:   5min 17sec

> Acummulator Interface for UDFs
> ------------------------------
>
>                 Key: PIG-979
>                 URL: https://issues.apache.org/jira/browse/PIG-979
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Alan Gates
>            Assignee: Ying He
>         Attachments: PIG-979.patch, PIG-979.patch
>
>
> Add an accumulator interface for UDFs that would allow them to take a set 
> number of records at a time instead of the entire bag.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to