[
https://issues.apache.org/jira/browse/PIG-5167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15905230#comment-15905230
]
Nandor Kollar commented on PIG-5167:
------------------------------------
[~knoguchi] AFAIK that's how it works right now: retrieve the result from HDFS,
sort it, and compare the sorted files (benchmark and actual). The problem here
I guess is this: we call distinct in the test. Then we call limit 100, but
since the order is not guaranteed, the result of the previous operation is an
arbitrary ordering of tuples, and we only retain the top 100 among these, we
don't know what that 100 is. So sorting the result won't help, because the set
of tuples is going to be different. Please correct me if I'm wrong, but I think
we should order the tuples before calling limit, and this is probably an issue
in MR and Tez mode, isn't it?
> Limit_4 is failing with spark exec type
> ---------------------------------------
>
> Key: PIG-5167
> URL: https://issues.apache.org/jira/browse/PIG-5167
> Project: Pig
> Issue Type: Sub-task
> Components: spark
> Reporter: Nandor Kollar
> Assignee: Nandor Kollar
> Fix For: spark-branch
>
> Attachments: PIG-5167.patch
>
>
> results are different:
> {code}
> diff <(head -n 5 Limit_4.out/out_sorted) <(head -n 5
> Limit_4_benchmark.out/out_sorted)
> 1,5c1,5
> < 50 3.00
> < 74 2.22
> < alice carson 66 2.42
> < alice quirinius 71 0.03
> < alice van buren 28 2.50
> ---
> > bob allen 0.28
> > bob allen 22 0.92
> > bob allen 25 2.54
> > bob allen 26 2.35
> > bob allen 27 2.17
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)