[
https://issues.apache.org/jira/browse/PIG-4960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rohini Palaniswamy updated PIG-4960:
------------------------------------
Status: Patch Available (was: Open)
> Split followed by order by/skewed join is skewed
> ------------------------------------------------
>
> Key: PIG-4960
> URL: https://issues.apache.org/jira/browse/PIG-4960
> Project: Pig
> Issue Type: Bug
> Reporter: Rohini Palaniswamy
> Assignee: Rohini Palaniswamy
> Fix For: 0.17.0, 0.16.1
>
> Attachments: PIG-4960-1.patch
>
>
> Sampling is not done right. Split is a special case as EOP is returned after
> each record is processed. We did fixes for that before (PIG-4480, etc), but
> still it is not done right.
> In case of skewed join, skipInterval is applied for each record instead of
> all the records. So except for the first record all the other records are
> mostly skipped. Sampling is slightly better than worse if there is a FLATTEN
> of bag on the input record to Split as there are multiple records to process.
>
> In case of order by, samples were being returned even as they were being
> updated with new data. So samples mostly contained records from the first few
> hundreds of rows.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)