[
https://issues.apache.org/jira/browse/DRILL-3414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14606373#comment-14606373
]
Jinfeng Ni commented on DRILL-3414:
-----------------------------------
My preliminary analysis seems to indicate that Drill produce correct result,
while Postgre does not.
The difference between q2.res (postgres output) and q2.out(drill output) start
at line 1945 - 1947, and the rest of remaining output.
Line 1935 - 1947
Drill:
{code}
true 1960-01-05 488139464 1464418392
true 1960-01-05 488139464 1464418392
true 1960-01-05 488139464 1464418392
true 1960-01-15 471050029 2877568479
true 1960-01-15 471050029 2877568479
true 1960-01-15 471050029 2877568479
true 1960-01-22 459201109 3795970697
true 1960-01-22 459201109 3795970697
true 1960-02-05 439657915 4675286527
true 1960-02-05 439657915 4675286527
true 1960-02-17 429666606 5964286345
true 1960-02-17 429666606 5964286345
true 1960-02-17 429666606 5964286345
{code}
Postgres:
{code}
true 1960-01-05 488139464 1464418392
true 1960-01-05 488139464 1464418392
true 1960-01-05 488139464 1464418392
true 1960-01-15 471050029 2877568479
true 1960-01-15 471050029 2877568479
true 1960-01-15 471050029 2877568479
true 1960-01-22 459201109 3795970697
true 1960-01-22 459201109 3795970697
true 1960-02-05 439657915 4675286527
true 1960-02-05 439657915 4675286527
null 1960-02-17 429666606 1288999818
null 1960-02-17 429666606 1288999818
null 1960-02-17 429666606 1288999818
{code}
We can see for c_integer = 429666606, Drill has c_boolean = true, and sum() as
5964286345. On postgres, for c_integer = 429666606, c_boolean becomes null,
and sum() becomes 1288999818.
Now, let's look at the data:
{code}
select c_boolean, c_date, c_integer from j4 where c_integer = 429666606;
+------------+-------------+------------+
| c_boolean | c_date | c_integer |
+------------+-------------+------------+
| true | 1960-02-17 | 429666606 |
+------------+-------------+------------+
select c_integer from j1 where c_integer= 429666606;
+------------+
| c_integer |
+------------+
| 429666606 |
| 429666606 |
| 429666606 |
+------------+=
{code}
Apparently, both table j1 and j4 has c_integer = 429666606, and the left join
does not produce null value from right side. Therefore, c_boolean should be
'true', and since window function use c_boolean as partition by column, those 3
rows from the join operator should belong to the group of 'true'. That means
sum() for those three rows is 4675286527 + 429666606 * 3 = 5964286345.
In contrast, Postgres seems to put c_boolean as null, and treat the 3 rows as a
different group, which seems to be not right.
> Window function on a null producing column of an outer join results in the
> wrong result
> ---------------------------------------------------------------------------------------
>
> Key: DRILL-3414
> URL: https://issues.apache.org/jira/browse/DRILL-3414
> Project: Apache Drill
> Issue Type: Bug
> Components: Query Planning & Optimization
> Affects Versions: 1.0.0
> Reporter: Victoria Markman
> Assignee: Jinfeng Ni
> Priority: Critical
> Labels: window_funcion
> Attachments: j1.tar, j4.parquet, query.tar
>
>
> {code:sql}
> select
> j4.c_boolean,
> j4.c_date,
> j4.c_integer,
> sum(j4.c_integer) over (partition by j4.c_boolean order by j4.c_date,
> j4.c_integer)
> from j1
> left outer join
> j4 on j1.c_integer = j4.c_integer
> order by 1,2,3;
> {code}
> If window function is on left side, query returns correct result.
> This works:
> {code:sql}
> select
> j1.c_boolean,
> j1.c_date,
> sum(j1.c_integer) over (partition by j1.c_boolean order by j1.c_date)
> from
> j1
> left outer join
> j4 on j1.c_integer = j4.c_integer
> order by
> 1, 2;
> {code}
> Attached:
> 1. query.tar (q2.sql , q2.res (postgres output), q2.out (drill output) )
> 2. tables : j1.tar, j4.parquet
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)