[
https://issues.apache.org/jira/browse/PIG-202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mathieu Poumeyrol updated PIG-202:
----------------------------------
Attachment: Sort.patch
As requested, a all-in-one patch (Sort.patch) that:
- call instantiateFunc on PO before the actual execution (fix using clause in
local context)
- discard the only one "late" comparator instantiation I could found (made
redundant, dead code)
- correct a marginal biais in the findQuantile builtin function (one of the
two extremum quantile was bigger or smaller depending on truncation)
- fix quantile job.
The quantile job issue is tricky. It is not easy to show how it misbehaves with
a pig unit test, as the result is correct... FindQuantiles is responsible for
defining a partition of the intermediary keyspace. Hadoop uses this partition
through a SortPartitioner instance to split the reduce half of the Sort job
among several reduce tasks. Now the FindQuartiles were using a StarSpec as a
comparator, whereas SortPartitioner were using the UDF comparator to perform a
Arrays.binarySearch. The binary search can not work correctly in these
conditions, and this leads to widely unbalanced reduce tasks as most of the
keys fall in the same partition.
"Prooving" this point actualy required counting how many items go to which
partition in SortPartitioner (some printf-like debugging). But honestly, I
think the patch just makes a lot of sense.
The fix just provides the UDF compartor to the sort used internaly by the
findQuartile job.
> ComparatorFunc provided to ORDER clause is not always honoured
> --------------------------------------------------------------
>
> Key: PIG-202
> URL: https://issues.apache.org/jira/browse/PIG-202
> Project: Pig
> Issue Type: Bug
> Reporter: Mathieu Poumeyrol
> Attachments: EvalSpec.patch, InstantiateFunc.patch,
> MapreducePlanCompiler.patch, Sort.patch, TestOderBy.patch
>
>
> Specifying a comparator function is acknowledge neither by local
> implementation, nor by quartile lookup job.
> Patch coming soon.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.