[
https://issues.apache.org/jira/browse/DRILL-3648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14697432#comment-14697432
]
Khurram Faraaz commented on DRILL-3648:
---------------------------------------
Here is the repro with a smaller dataset.
Results returned by Drill.
{code}
0: jdbc:drill:schema=dfs.tmp> select col7 , ntile(5) over(partition by col7
order by col0) NTILE_col7 from FEWRWSPQQ_101;
+--------+-------------+
| col7 | NTILE_col7 |
+--------+-------------+
| false | 1 |
| false | 2 |
| false | 3 |
| false | 4 |
| false | 5 |
| false | 5 |
| false | 5 |
| false | 5 |
| false | 5 |
| false | 5 |
| false | 5 |
| true | 1 |
| true | 2 |
| true | 3 |
| true | 4 |
| true | 5 |
| true | 5 |
| true | 5 |
| true | 5 |
| true | 5 |
| true | 5 |
| true | 5 |
+--------+-------------+
22 rows selected (0.248 seconds)
{code}
Results returned by Postgres
{code}
postgres=# select col7 , ntile(5) over(partition by col7 order by col0)
NTILE_col7 from FEWRWSPQQ_101;
col7 | ntile_col7
------+------------
f | 1
f | 1
f | 1
f | 2
f | 2
f | 3
f | 3
f | 4
f | 4
f | 5
f | 5
t | 1
t | 1
t | 1
t | 2
t | 2
t | 3
t | 3
t | 4
t | 4
t | 5
t | 5
(22 rows)
{code}
Interesting observation is that, when I use order by 1 in the window definition
we see correct results.
{code}
0: jdbc:drill:schema=dfs.tmp> select col7 , ntile(5) over(partition by col7
order by 1) NTILE_col7 from FEWRWSPQQ_101;
+--------+-------------+
| col7 | NTILE_col7 |
+--------+-------------+
| false | 1 |
| false | 1 |
| false | 1 |
| false | 2 |
| false | 2 |
| false | 3 |
| false | 3 |
| false | 4 |
| false | 4 |
| false | 5 |
| false | 5 |
| true | 1 |
| true | 1 |
| true | 1 |
| true | 2 |
| true | 2 |
| true | 3 |
| true | 3 |
| true | 4 |
| true | 4 |
| true | 5 |
| true | 5 |
+--------+-------------+
22 rows selected (0.277 seconds)
{code}
> NTILE function returns incorrect results
> ----------------------------------------
>
> Key: DRILL-3648
> URL: https://issues.apache.org/jira/browse/DRILL-3648
> Project: Apache Drill
> Issue Type: Bug
> Components: Execution - Flow
> Affects Versions: 1.2.0
> Environment: private-branch
> https://github.com/adeneche/incubator-drill/tree/new-window-funcs
> Reporter: Khurram Faraaz
> Assignee: Deneche A. Hakim
> Priority: Critical
> Labels: window_function
> Fix For: 1.2.0
>
>
> NTILE function returns incorrect results for larger dataset. I am working on
> reproducing the problem with a smaller dataset.
> The inner query that uses NTILE should have divided the rows into two sets
> (tiles) where each tile consists of (937088 + 1 ) rows , 937088 rows
> {code}
> 0: jdbc:drill:schema=dfs.tmp> select ntile_key2, count(ntile_key2) from
> (select ntile(2) over(partition by key2 order by key1) ntile_key2 from
> `twoKeyJsn.json` where key2 = 'm') group by ntile_key2;
> +-------------+----------+
> | ntile_key2 | EXPR$1 |
> +-------------+----------+
> | 1 | 1 |
> | 2 | 1874176 |
> +-------------+----------+
> 2 rows selected (49.406 seconds)
> {code}
> Explain plan for inner query that returns wrong results.
> {code}
> 0: jdbc:drill:schema=dfs.tmp> explain plan for select ntile(2) over(partition
> by key2 order by key1) from `twoKeyJsn.json` where key2 = 'm';
> +------+------+
> | text | json |
> +------+------+
> | 00-00 Screen
> 00-01 UnionExchange
> 01-01 Project(EXPR$0=[$0])
> 01-02 Project($0=[$2])
> 01-03 Window(window#0=[window(partition {0} order by [1] range
> between UNBOUNDED PRECEDING and CURRENT ROW aggs [NTILE($2)])])
> 01-04 SelectionVectorRemover
> 01-05 Sort(sort0=[$0], sort1=[$1], dir0=[ASC], dir1=[ASC])
> 01-06 Project(key2=[$0], key1=[$1])
> 01-07 HashToRandomExchange(dist0=[[$0]])
> 02-01 UnorderedMuxExchange
> 03-01 Project(key2=[$0], key1=[$1],
> E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($0))])
> 03-02 SelectionVectorRemover
> 03-03 Filter(condition=[=($0, 'm')])
> 03-04 Scan(groupscan=[EasyGroupScan
> [selectionRoot=maprfs:/tmp/twoKeyJsn.json, numFiles=1, columns=[`key2`,
> `key1`], files=[maprfs:///tmp/twoKeyJsn.json]]])
> {code}
> Total number of rows in partition that has key2 = 'm'
> {code}
> 0: jdbc:drill:schema=dfs.tmp> select count(key1) from `twoKeyJsn.json` where
> key2 = 'm';
> +----------+
> | EXPR$0 |
> +----------+
> | 1874177 |
> +----------+
> 1 row selected (37.581 seconds)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)