[
https://issues.apache.org/jira/browse/PIG-30?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557457#action_12557457
]
Alan Gates commented on PIG-30:
-------------------------------
Some performance numbers based on the code before and after these changes. I
tested default bags (that is, no sorting, no distinct), distinct bags, and
sorted bags. Each test was run on the code pre- and post-patch. Each test was
run on data with 100k rows, 1m rows, and and 5m rows.
Default:
pig script:
a = load './studenttab5m';
b = group a all;
c = foreach b generate group, COUNT(a.$0);
dump c;
Results:
pre patch, 100k rows: 13.539
post 100k: 15.489
pre 1m: 43.002
post 1m: 48.191
pre 5m: 111.158
post 5m: 117.112
Notes: I'm assuming the slight slowdown here is do to the introduction of
locking into add() and next() in the data bags.
Distinct
pig script:
a = load './studenttab10m';
b = group a all;
c = foreach b { c1 = distinct $1; generate group, COUNT(c1); }
dump c;
pre-patch 100k rows: 14.927
post 100k: 14.134
pre 1m: 83.190
post 1m: 52.320
pre 5m: 744.834
post 5m: 216.043
Notes: Data had about 90% distinct values, so 100k had about 90k distinct
rows, etc.
Sorted
pig script:
a = load './studenttab5m';
b = group a all;
c = foreach b { c1 = order $1 by $0; generate group, COUNT(c1); }
dump c;
pre-patch 100k rows: 16.964
post 100k: 12.895
pre 1m: 51.351
post 1m: 51.598
pre 5m: 236.669
post 5m: 225.688
> Get rid of DataBag and always use BigDataBag
> --------------------------------------------
>
> Key: PIG-30
> URL: https://issues.apache.org/jira/browse/PIG-30
> Project: Pig
> Issue Type: Bug
> Components: data
> Reporter: Benjamin Reed
> Assignee: Alan Gates
> Attachments: addhashcode.patch, bagrewrite.patch
>
>
> We should never use DataBag directly; instead, we should always use
> BigDataBag. I think we already do this. The problem is that the logic in
> BigDataBag is hard to follow and it is made more complicated because it
> subclasses DataBag. We should merge these two classes together.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.