[ 
https://issues.apache.org/jira/browse/PIG-4960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15393036#comment-15393036
 ] 

Rohini Palaniswamy commented on PIG-4960:
-----------------------------------------

int rand = randGen.nextInt(rowProcessed + 1);  - This change wasn't necessary. 
But just put it there to be same as RandomSampleLoader code which did rowNum = 
numSamples+1

> Split followed by order by/skewed join is skewed
> ------------------------------------------------
>
>                 Key: PIG-4960
>                 URL: https://issues.apache.org/jira/browse/PIG-4960
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Rohini Palaniswamy
>            Assignee: Rohini Palaniswamy
>             Fix For: 0.17.0, 0.16.1
>
>         Attachments: PIG-4960-1.patch
>
>
> Sampling is not done right. Split is a special case as EOP is returned after 
> each record is processed. We did fixes for that before (PIG-4480, etc), but 
> still it is not done right.  
>    In case of skewed join, skipInterval is applied for each record instead of 
> all the records. So except for the first record all the other records are 
> mostly skipped. Sampling is slightly better than worse if there is a FLATTEN 
> of bag on the input record to Split as there are multiple records to process. 
>  
>   In case of order by, samples were being returned even as they were being 
> updated with new data. So samples mostly contained records from the first few 
> hundreds of rows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to