[ 
https://issues.apache.org/jira/browse/PIG-5167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15905230#comment-15905230
 ] 

Nandor Kollar commented on PIG-5167:
------------------------------------

[~knoguchi] AFAIK that's how it works right now: retrieve the result from HDFS, 
sort it, and compare the sorted files (benchmark and actual). The problem here 
I guess is this: we call distinct in the test. Then we call limit 100, but 
since the order is not guaranteed, the result of the previous operation is an 
arbitrary ordering of tuples, and we only retain the top 100 among these, we 
don't know what that 100 is. So sorting the result won't help, because the set 
of tuples is going to be different. Please correct me if I'm wrong, but I think 
we should order the tuples before calling limit, and this is probably an issue 
in MR and Tez mode, isn't it?

> Limit_4 is failing with spark exec type
> ---------------------------------------
>
>                 Key: PIG-5167
>                 URL: https://issues.apache.org/jira/browse/PIG-5167
>             Project: Pig
>          Issue Type: Sub-task
>          Components: spark
>            Reporter: Nandor Kollar
>            Assignee: Nandor Kollar
>             Fix For: spark-branch
>
>         Attachments: PIG-5167.patch
>
>
> results are different:
> {code}
> diff <(head -n 5 Limit_4.out/out_sorted) <(head -n 5 
> Limit_4_benchmark.out/out_sorted)
> 1,5c1,5
> <     50      3.00
> <     74      2.22
> < alice carson        66      2.42
> < alice quirinius     71      0.03
> < alice van buren     28      2.50
> ---
> > bob allen           0.28
> > bob allen   22      0.92
> > bob allen   25      2.54
> > bob allen   26      2.35
> > bob allen   27      2.17
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to