[ 
https://issues.apache.org/jira/browse/PIG-285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12611759#action_12611759
 ] 

Shravan Matthur Narayanamurthy commented on PIG-285:
----------------------------------------------------

I have found the issue. Before describing it let me give some background so 
that fixing other related issues is simpler:

The Order by Clause is handled with Quantiles. So a job which has an order by 
occuring in the main plan is run as multiple jobs:
1. Store the output till the order by.
2. Run a quantile job to find the quantiles
3. Run the sort job.

The following should be the pig-script version of getQuantileJob in MRCompiler
{noformat}
A = load fSpec using RandomSampleLoader
B = foreach A generate flatten(col1), flatten(col2), ...
C = group all
D = foreach C {
        D1 = order $1 by *;
        generate requestedParallelism,D1;
}
E = foreach D generate FindQuantiles(*);
store E into quantFiles
{noformat}

The getSortJob should look something like this
{noformat}
A = load fSpec using BinStorage
B = group A by (col1,col2,...);
C = foreach B generate flatten(A);
{noformat}

C should have the output of ORDER BY

Also, the sort job should have some key things turned on in Hadoop for it to 
work:
1. Use the SortPartitioner as the key partitioner which internally uses the 
quantile file generated by the quantile job
2. Also supply any user defined comparator to hadoop as the output key 
comparator

That is the ideal thing to do. The issue was the following:
Since the quantile job physical plan was hand crafted, it had the plan for the 
following instead of what it should have been:
{noformat}
A = load fSpec using RandomSampleLoader
B = foreach A generate flatten(col1), flatten(col2), ...
C = group all
D = foreach C {
        generate requestedParallelism,$1;
}
E = foreach D generate FindQuantiles(*);
store E into quantFiles;
{noformat}

Hence instead of the sorted output, all we saw was the grouped output and 
probably some incorrect results as the quantiles might have got messed if a 
parallel statement was used along with the order by which is the cause for 
[Pig-292|https://issues.apache.org/jira/browse/PIG-292].

The other part was that the user defined comparator was not being passed as the 
output key comparator which is the cause of the current bug. Another thing that 
led to us not finding the bug early was an error in the testSort test case 
which I have corrected in 
[Pig-295|https://issues.apache.org/jira/browse/PIG-295].

To resolve this issue, first corrected the quantile job to include the order by 
in the nested plan. However this caused issues with deserializing 
POUserComparisonFunc which extended from POUserFunc. The issue was because when 
POUserComparisonFunc was deserialized POUserFunc got deserialized first and 
tried to instantiate EvalFunc from a ComparisonFunc spec. To resolve this, I 
had to make POUserComparisonFunc independent of POUserFunc and here I have made 
the assumption that ComparisonFunc is used only in ORDER BY and not elsewhere. 
This corresponds to all the extraneous things in the patch.

The next thing I did was to try to correct the missing supply of user defined 
comparator to Hadoop as the key comparator. However, this causes issues:
We assume that ComparisonFunc always compares Tuples. However, with the 
inclusion of types, we do not always wrap everything into a tuple and instead 
try to use the basic types wherever possible. The patch I am going to submit 
does not address this part. The patch will assume that issue with 
ComparisonFunc will be fixed and directly sets the user defined comparator as 
the output key comparator. This will for the time being cause all user defined 
comparisons to fail.

Some hints on the ComparisonFunc issue:
1. The soln should take into consideration that sometimes ComparisonFunc are 
generic and need not know the schema of the input. Ex. OrdDesc
2. Many a times however, if its not a generic ComparisonFunc, we can assume 
that schema is known.
3. The ComparisonFunc will have to work with hadoop types and not pig types as 
it would be used in the boundary between LR & Pkg

Currently, ComparisonFunc extends WritableComparator and gives a concrete 
implementation that delegates all 
compare(WritableComparable,WritableComparable) calls to compare(Tuple,Tuple). 
Instead if we leave the compare(WritableComparable,WritableComparable) abstract 
I feel it should solve the problem and users can provide an implementation of 
the compare for the type that they are expecting. Will attach a patch shortly.

> custom compare functions is ignored
> -----------------------------------
>
>                 Key: PIG-285
>                 URL: https://issues.apache.org/jira/browse/PIG-285
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch
>            Reporter: Olga Natkovich
>            Assignee: Shravan Matthur Narayanamurthy
>
> The following query successfully runs but the results don't come in the 
> correct order:
> a = load 'studenttab10k';
> c = order a by $0 using org.apache.pig.test.udf.orderby.OrdDesc;
> store c into ;out';
> results:
> alice allen     27      1.950
> alice allen     42      2.460
> alice allen     38      0.810
> alice allen     68      3.390
> alice allen     77      2.520
> alice allen     36      2.270
> .....
> expcted:
> zach zipper     66      2.670
> zach zipper     47      2.920
> zach zipper     19      1.910
> zach zipper     23      1.120
> zach zipper     40      2.030
> zach zipper     59      2.530
> .....

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to