[ 
https://issues.apache.org/jira/browse/PIG-30?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557457#action_12557457
 ] 

Alan Gates commented on PIG-30:
-------------------------------

Some performance numbers based on the code before and after these changes.  I 
tested default bags (that is, no sorting, no distinct), distinct bags, and 
sorted bags.  Each test was run on the code pre- and post-patch.  Each test was 
run on data with 100k rows, 1m rows, and and 5m rows.

Default:

pig script:

a = load './studenttab5m';

b = group a all;

c = foreach b generate group, COUNT(a.$0);

dump c;

Results:

pre patch, 100k rows:  13.539

post 100k:  15.489

pre 1m:  43.002

post 1m:  48.191

pre 5m: 111.158

post 5m:  117.112

Notes:  I'm assuming the slight slowdown here is do to the introduction of 
locking into add() and next() in the data bags.

Distinct

pig script:

a = load './studenttab10m';

b = group a all;

c = foreach b { c1 = distinct $1; generate group, COUNT(c1); }

dump c;

pre-patch 100k rows:  14.927

post 100k:  14.134

pre 1m:  83.190

post 1m: 52.320

pre 5m:  744.834

post 5m:  216.043

Notes:  Data had about 90% distinct values, so 100k had about 90k distinct 
rows, etc.

Sorted

pig script:

a = load './studenttab5m';

b = group a all;

c = foreach b { c1 = order $1 by $0; generate group, COUNT(c1); }

dump c;

pre-patch 100k rows:  16.964

post 100k: 12.895

pre 1m:  51.351

post 1m:  51.598

pre 5m:  236.669

post 5m:  225.688

> Get rid of DataBag and always use BigDataBag
> --------------------------------------------
>
>                 Key: PIG-30
>                 URL: https://issues.apache.org/jira/browse/PIG-30
>             Project: Pig
>          Issue Type: Bug
>          Components: data
>            Reporter: Benjamin Reed
>            Assignee: Alan Gates
>         Attachments: addhashcode.patch, bagrewrite.patch
>
>
> We should never use DataBag directly; instead, we should always use 
> BigDataBag. I think we already do this. The problem is that the logic in 
> BigDataBag is hard to follow and it is made more complicated because it 
> subclasses DataBag. We should merge these two classes together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to