Problems with some top N queries
--------------------------------
Key: PIG-1169
URL: https://issues.apache.org/jira/browse/PIG-1169
Project: Pig
Issue Type: Bug
Components: impl
Affects Versions: 0.7.0
Reporter: Richard Ding
Recently, a couple of problems related to the Top N queries were reported by
users.
* From Chuang Liu:
We tried to get top N results after a groupby and sort, and got different
results with or without storing the full sorted results. Here is a skeleton of
our pig script.
{code}
raw_data = Load '<input_files>' AS (f1, f2, ..., fn);
grouped = group raw_data by (f1, f2);
data = foreach grouped generate FLATTEN(group). SUM(raw_data.fk) as value;
ordered = order data by value DESC parallel 10;
topn = limit ordered 10;
store ordered into 'outputdir/full';
store topn into 'outputdir/topn';
{code}
With the statement 'store ordered ...', top N results are incorrect, but
without the statement, results are correct. Has anyone seen this before? I know
a similar bug has been fixed in the multi-query release. We are on pig
.4 and hadoop .20.1.
* From Corry Haines:
I am not sure if this is a bug, or something more subtle, but here is the
problem that I am having.
When I LOAD a dataset, change it with an ORDER, LIMIT it, then CROSS it with
itself, the results are not correct. I expect to see the cross of the limited,
ordered dataset, but instead I see the cross of the limited dataset.
Effectively, its like the LIMIT is being excluded.
Pig Version: 0.5.0
Hadoop Version: 0.20.1
I would greatly appreciate some help, as this is somewhat frustrating.
Example code (and output) follows:
{code}
A = load 'foo' as (f1:int, f2:int, f3:int); B = load 'foo' as (f1:int, f2:int,
f3:int);
a = ORDER A BY f1 DESC;
b = ORDER B BY f1 DESC;
aa = LIMIT a 1;
bb = LIMIT b 1;
C = CROSS aa, bb;
DUMP C;
{code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.