[
https://issues.apache.org/jira/browse/PIG-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12870508#action_12870508
]
Jeff Zhang commented on PIG-1426:
---------------------------------
I did a simple experiment for the performance comparison.
This is the pig script I used
{code}
a = load '/input';
b = foreach a generate $0,$1;
c = group b by $0 PARALLEL 2;
result = foreach c generate group,SUM(b.$1);
dump result;
{code}
And the following is the result
|| ||Using Int||Using VInt||
|Mapper Output|3,288,892,896|2,688,892,896|
|Time cost for the pig script|12mins, 23sec|12mins, 1sec|
I haven't did a complete comparison of PigMix, but I believed it will improve
the performance.
> Change the size of Tuple from Int to VInt when Serialize Tuple
> --------------------------------------------------------------
>
> Key: PIG-1426
> URL: https://issues.apache.org/jira/browse/PIG-1426
> Project: Pig
> Issue Type: Improvement
> Components: data
> Affects Versions: 0.8.0
> Reporter: Jeff Zhang
> Assignee: Jeff Zhang
> Fix For: 0.8.0
>
> Attachments: PIG_1426.patch
>
>
> Most of time, the size of tuple is not very large, one byte is enough for
> store the size of tuple. So I suggest to use VInt instead of Int for the size
> of tuple when doing Serialization. Because the key type of map output is
> Tuple, so this can reduce the amount of data transferred from mapper to
> reducer.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.