[
https://issues.apache.org/jira/browse/DRILL-3689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14709463#comment-14709463
]
Deneche A. Hakim commented on DRILL-3689:
-----------------------------------------
Looking at the output of the inner query:
{noformat}
select col7, col1, lead(col1) over(partition by col7 order by col1) leadcol1
from `3648.parquet` order by col7, col1 limit 20;
+--------+----------------------+----------------------+
| col7 | col1 | leadcol1 |
+--------+----------------------+----------------------+
| false | -1 | 1 |
| false | 1 | 17 |
| false | 17 | 30 |
| false | 30 | 200 |
| false | 200 | 1000 |
| false | 1000 | 1001 |
| false | 1001 | 5000 |
| false | 5000 | 65534 |
| false | 65534 | 4611686018427387903 |
| false | 4611686018427387903 | 9223372036854775807 |
| false | 9223372036854775807 | null |
| true | -65535 | 0 |
| true | 0 | 13 |
| true | 13 | 23 |
| true | 23 | 25 |
| true | 25 | 197 |
| true | 197 | 3000 |
| true | 3000 | 9999999 |
| true | 9999999 | 10000000 |
| true | 10000000 | 92233720385475807 |
+--------+----------------------+----------------------+
{noformat}
Because col1 is stored as an INT64, trying to compute the average of
{{lead(col1)}} will cause an overflow. This explains why we have different
results than Postgres.
[~khfaraaz] Can you confirm this by running the same query on a column with
smaller numbers ?
> incorrect results : aggregate AVG returns wrong results over results returned
> by LEAD function.
> -----------------------------------------------------------------------------------------------
>
> Key: DRILL-3689
> URL: https://issues.apache.org/jira/browse/DRILL-3689
> Project: Apache Drill
> Issue Type: Bug
> Components: Execution - Relational Operators
> Affects Versions: 1.2.0
> Environment: private-branch
> https://github.com/adeneche/incubator-drill/tree/new-window-funcs
> Reporter: Khurram Faraaz
> Assignee: Deneche A. Hakim
> Priority: Critical
> Labels: window_function
> Fix For: 1.2.0
>
>
> Aggregate AVG returns wrong results over results returned by LEAD function.
> results returned by Drill
> {code}
> 0: jdbc:drill:schema=dfs.tmp> SELECT avg(lead_col1) FROM (SELECT LEAD(col1)
> OVER(PARTITION BY col7 ORDER BY col1) lead_col1 , col7 FROM FEWRWSPQQ_101)
> sub_query;
> +-------------------------+
> | EXPR$0 |
> +-------------------------+
> | 2.35195986941647008E17 |
> +-------------------------+
> 1 row selected (0.264 seconds)
> {code}
> Explain plan for above query from Drill
> {code}
> 0: jdbc:drill:schema=dfs.tmp> explain plan for SELECT avg(lead_col1) FROM
> (SELECT LEAD(col1) OVER(PARTITION BY col7 ORDER BY col1) lead_col1 , col7
> FROM FEWRWSPQQ_101) sub_query;
> +------+------+
> | text | json |
> +------+------+
> | 00-00 Screen
> 00-01 Project(EXPR$0=[$0])
> 00-02 Project(EXPR$0=[CAST(/(CastHigh(CASE(=($1, 0), null, $0)),
> $1)):ANY NOT NULL])
> 00-03 StreamAgg(group=[{}], agg#0=[$SUM0($0)], agg#1=[COUNT($0)])
> 00-04 Project(w0$o0=[$3])
> 00-05 Window(window#0=[window(partition {2} order by [1] range
> between UNBOUNDED PRECEDING and CURRENT ROW aggs [LEAD($1)])])
> 00-06 SelectionVectorRemover
> 00-07 Sort(sort0=[$2], sort1=[$1], dir0=[ASC], dir1=[ASC])
> 00-08 Project(T36¦¦*=[$0], col1=[$1], col7=[$2])
> 00-09 Scan(groupscan=[ParquetGroupScan
> [entries=[ReadEntryWithPath [path=maprfs:///tmp/FEWRWSPQQ_101]],
> selectionRoot=maprfs:/tmp/FEWRWSPQQ_101, numFiles=1, columns=[`*`]]])
> {code}
> results returned by Postgres
> {code}
> postgres=# SELECT avg(lead_col1) FROM (SELECT LEAD(col1) OVER(PARTITION BY
> col7 ORDER BY col1) lead_col1 , col7 FROM FEWRWSPQQ_101) sub_query;
> avg
> ---------------------
> 1157533190627124568
> (1 row)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)