[ 
https://issues.apache.org/jira/browse/PIG-3648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13863770#comment-13863770
 ] 

Rohini Palaniswamy commented on PIG-3648:
-----------------------------------------

I don't think it is possible to configure it right unless we store statistics 
on the total number of records (using something like hraven) and use that to 
determine the sample size as a proportion dynamically. Otherwise the best 
option is to let the user specify a sample size as we don't know the number of 
records until the map completes. 

On a different note, when I was checking code to confirm that the samples only 
contain the order by columns saw that MR does RandomSampleLoader -> Foreach (to 
project sort columns) because it was loader. In Tez, [~daijy] had fixed it to 
do POForeach - > POReservoirSample projecting the columns early. 

> Make the sample size for RandomSampleLoader configurable
> --------------------------------------------------------
>
>                 Key: PIG-3648
>                 URL: https://issues.apache.org/jira/browse/PIG-3648
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Cheolsoo Park
>            Assignee: Cheolsoo Park
>            Priority: Minor
>             Fix For: 0.13.0
>
>         Attachments: PIG-3648-1.patch
>
>
> Pig uses RandomSampleLoader for range partitioning in order-by. But since the 
> sample size is hardcoded as 100, volatility in the variance of the results  
> increases when sorting a large number of rows (e.g. 10M+ per task).
> It would be nice if the sample size could be configurable via Pig properties.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to