[
https://issues.apache.org/jira/browse/PIG-3648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13863770#comment-13863770
]
Rohini Palaniswamy commented on PIG-3648:
-----------------------------------------
I don't think it is possible to configure it right unless we store statistics
on the total number of records (using something like hraven) and use that to
determine the sample size as a proportion dynamically. Otherwise the best
option is to let the user specify a sample size as we don't know the number of
records until the map completes.
On a different note, when I was checking code to confirm that the samples only
contain the order by columns saw that MR does RandomSampleLoader -> Foreach (to
project sort columns) because it was loader. In Tez, [~daijy] had fixed it to
do POForeach - > POReservoirSample projecting the columns early.
> Make the sample size for RandomSampleLoader configurable
> --------------------------------------------------------
>
> Key: PIG-3648
> URL: https://issues.apache.org/jira/browse/PIG-3648
> Project: Pig
> Issue Type: Improvement
> Components: impl
> Reporter: Cheolsoo Park
> Assignee: Cheolsoo Park
> Priority: Minor
> Fix For: 0.13.0
>
> Attachments: PIG-3648-1.patch
>
>
> Pig uses RandomSampleLoader for range partitioning in order-by. But since the
> sample size is hardcoded as 100, volatility in the variance of the results
> increases when sorting a large number of rows (e.g. 10M+ per task).
> It would be nice if the sample size could be configurable via Pig properties.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)