[
https://issues.apache.org/jira/browse/PIG-285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12611759#action_12611759
]
Shravan Matthur Narayanamurthy commented on PIG-285:
----------------------------------------------------
I have found the issue. Before describing it let me give some background so
that fixing other related issues is simpler:
The Order by Clause is handled with Quantiles. So a job which has an order by
occuring in the main plan is run as multiple jobs:
1. Store the output till the order by.
2. Run a quantile job to find the quantiles
3. Run the sort job.
The following should be the pig-script version of getQuantileJob in MRCompiler
{noformat}
A = load fSpec using RandomSampleLoader
B = foreach A generate flatten(col1), flatten(col2), ...
C = group all
D = foreach C {
D1 = order $1 by *;
generate requestedParallelism,D1;
}
E = foreach D generate FindQuantiles(*);
store E into quantFiles
{noformat}
The getSortJob should look something like this
{noformat}
A = load fSpec using BinStorage
B = group A by (col1,col2,...);
C = foreach B generate flatten(A);
{noformat}
C should have the output of ORDER BY
Also, the sort job should have some key things turned on in Hadoop for it to
work:
1. Use the SortPartitioner as the key partitioner which internally uses the
quantile file generated by the quantile job
2. Also supply any user defined comparator to hadoop as the output key
comparator
That is the ideal thing to do. The issue was the following:
Since the quantile job physical plan was hand crafted, it had the plan for the
following instead of what it should have been:
{noformat}
A = load fSpec using RandomSampleLoader
B = foreach A generate flatten(col1), flatten(col2), ...
C = group all
D = foreach C {
generate requestedParallelism,$1;
}
E = foreach D generate FindQuantiles(*);
store E into quantFiles;
{noformat}
Hence instead of the sorted output, all we saw was the grouped output and
probably some incorrect results as the quantiles might have got messed if a
parallel statement was used along with the order by which is the cause for
[Pig-292|https://issues.apache.org/jira/browse/PIG-292].
The other part was that the user defined comparator was not being passed as the
output key comparator which is the cause of the current bug. Another thing that
led to us not finding the bug early was an error in the testSort test case
which I have corrected in
[Pig-295|https://issues.apache.org/jira/browse/PIG-295].
To resolve this issue, first corrected the quantile job to include the order by
in the nested plan. However this caused issues with deserializing
POUserComparisonFunc which extended from POUserFunc. The issue was because when
POUserComparisonFunc was deserialized POUserFunc got deserialized first and
tried to instantiate EvalFunc from a ComparisonFunc spec. To resolve this, I
had to make POUserComparisonFunc independent of POUserFunc and here I have made
the assumption that ComparisonFunc is used only in ORDER BY and not elsewhere.
This corresponds to all the extraneous things in the patch.
The next thing I did was to try to correct the missing supply of user defined
comparator to Hadoop as the key comparator. However, this causes issues:
We assume that ComparisonFunc always compares Tuples. However, with the
inclusion of types, we do not always wrap everything into a tuple and instead
try to use the basic types wherever possible. The patch I am going to submit
does not address this part. The patch will assume that issue with
ComparisonFunc will be fixed and directly sets the user defined comparator as
the output key comparator. This will for the time being cause all user defined
comparisons to fail.
Some hints on the ComparisonFunc issue:
1. The soln should take into consideration that sometimes ComparisonFunc are
generic and need not know the schema of the input. Ex. OrdDesc
2. Many a times however, if its not a generic ComparisonFunc, we can assume
that schema is known.
3. The ComparisonFunc will have to work with hadoop types and not pig types as
it would be used in the boundary between LR & Pkg
Currently, ComparisonFunc extends WritableComparator and gives a concrete
implementation that delegates all
compare(WritableComparable,WritableComparable) calls to compare(Tuple,Tuple).
Instead if we leave the compare(WritableComparable,WritableComparable) abstract
I feel it should solve the problem and users can provide an implementation of
the compare for the type that they are expecting. Will attach a patch shortly.
> custom compare functions is ignored
> -----------------------------------
>
> Key: PIG-285
> URL: https://issues.apache.org/jira/browse/PIG-285
> Project: Pig
> Issue Type: Bug
> Affects Versions: types_branch
> Reporter: Olga Natkovich
> Assignee: Shravan Matthur Narayanamurthy
>
> The following query successfully runs but the results don't come in the
> correct order:
> a = load 'studenttab10k';
> c = order a by $0 using org.apache.pig.test.udf.orderby.OrdDesc;
> store c into ;out';
> results:
> alice allen 27 1.950
> alice allen 42 2.460
> alice allen 38 0.810
> alice allen 68 3.390
> alice allen 77 2.520
> alice allen 36 2.270
> .....
> expcted:
> zach zipper 66 2.670
> zach zipper 47 2.920
> zach zipper 19 1.910
> zach zipper 23 1.120
> zach zipper 40 2.030
> zach zipper 59 2.530
> .....
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.