[
https://issues.apache.org/jira/browse/PIG-450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alan Gates updated PIG-450:
---------------------------
Attachment: PIG-450.patch
This patch adds a combiner step to distincts that just removes the duplicate
values so that less data is carried across from map to reduce. Here are the
resulting time differences (all times in seconds):
||Num records||Num keys||Num reducers||1.4 || 2.0 || 2.0 with this patch ||
| 200M | 60 | 1 | 2547 | 1388 | 142 |
| 200M | 16M | 50 | 384 | 227 | 231 |
The main benefit is with a small number of keys, but there does not appear to
be a penalty with a larger number of keys.
> PERFORMANCE: Distinct should make use of combiner to remove duplicate values
> from keys.
> ----------------------------------------------------------------------------------------
>
> Key: PIG-450
> URL: https://issues.apache.org/jira/browse/PIG-450
> Project: Pig
> Issue Type: Bug
> Components: impl
> Affects Versions: types_branch
> Reporter: Alan Gates
> Assignee: Alan Gates
> Fix For: types_branch
>
> Attachments: PIG-450.patch
>
>
> In 2.0 distinct was improved by removing values in the map and just passing
> an empty tuple along with the key. This can be further improved by adding a
> combiner step that passes along only the first empty tuple instead of all of
> them.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.