[
https://issues.apache.org/jira/browse/PIG-791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12704815#action_12704815
]
Dick King commented on PIG-791:
-------------------------------
I am considering a modification to hadoop that would allow users to designate
that a map/reduce output is a: sorted, b: likely to be the input to some other
map/reduce where selected keys are re-emitted by the second mapper unchanged,
with probability not correlated by the ordering, and c: the same sort order is
used in the second map/reduce.
It would work by writing a sample file as a secondary output of the mapper in
the first map/reduce.
This proposal in on my back burner, but could potentially be moved up.
Would that functionality be generally useful here?
> reducing number of MR stages with ORDER BY
> ------------------------------------------
>
> Key: PIG-791
> URL: https://issues.apache.org/jira/browse/PIG-791
> Project: Pig
> Issue Type: Improvement
> Affects Versions: 0.2.0
> Reporter: Olga Natkovich
>
> When an order by is not the only operation in a pig script, it is done in two
> additional MR jobs. The first job samples using a sampling loader, the second
> does the sort. The sample is used to construct a partitioner that equally
> balances the data in the sort. The sampler needs to be changed to be a
> EvalFunc instead of a loader. This way a split can be but in the proceeding
> MR job, with the main data being written out and the other part flowing to
> the sampler func, which can then write out the sample. The final MR job can
> then be the sort.
> This change depends on multiquery code.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.