[ 
https://issues.apache.org/jira/browse/DRILL-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158251#comment-14158251
 ] 

Jinfeng Ni commented on DRILL-1487:
-----------------------------------

Looks like the incorrect results issue is caused by missing the following rows 
that are equal to the current row w.r.t the ORDER BY CLAUSE in the window 
frame. Here is the explanation of the concept of window frame in posgresql: 
 
"There is another important concept associated with window functions: for each 
row, there is a set of rows within its partition called its window frame. Many 
(but not all) window functions act only on the rows of the window frame, rather 
than of the whole partition. By default, if ORDER BY is supplied then the frame 
consists of all rows from the start of the partition up through the current 
row, plus any following rows that are equal to the current row according to the 
ORDER BY clause. When ORDER BY is omitted the default frame consists of all 
rows in the partition."

If there is order by clause in the window definition, then, the window 
aggregate function would be applied to the windows frame, which consists the 
row fro the start of the partition up through the current row, plus any 
following rows that are equal to the current row w.r.t ORDER BY clause.  In the 
original query, since ORDER BY is same as the PARTITION column, the window 
aggregate function should be applied to all the rows in the partition, while 
the current implementation seems to miss the rows of "that are equal to the 
current row according to the ORDER BY clause.".

If re-modify the original query into the following query, we can clearly see 
that avg_sal is computed over the start row to the current row, but ignore the 
following rows that are equal to current row accord to the ORDER BY clause.  

{code}

SELECT employee_id,position_id, salary, avg(salary) OVER (PARTITION BY 
position_id order by position_id) as avg_sal FROM cp.`employee.json` order by 
position_id limit 50;
+-------------+-------------+------------+------------+
| employee_id | position_id |   salary   |  avg_sal   |
+-------------+-------------+------------+------------+
| 1           | 1           | 80000.0    | 80000.0    |
| 5           | 2           | 35000.0    | 35000.0    |
| 2           | 2           | 40000.0    | 37500.0    |
| 4           | 2           | 40000.0    | 38333.333333333336 |
| 20          | 2           | 30000.0    | 36250.0    |
| 21          | 2           | 35000.0    | 36000.0    |
| 22          | 2           | 35000.0    | 35833.333333333336 |
| 6           | 3           | 25000.0    | 25000.0    |
| 7           | 4           | 15000.0    | 15000.0    |
| 10          | 5           | 50000.0    | 50000.0    |
| 40          | 6           | 10000.0    | 10000.0    |
| 39          | 6           | 10000.0    | 10000.0    |
| 37          | 6           | 6700.0     | 8900.0     |
| 38          | 6           | 8000.0     | 8675.0     |
| 42          | 7           | 5000.0     | 5000.0     |
| 36          | 7           | 45000.0    | 25000.0    |
| 41          | 7           | 8500.0     | 19500.0    |
| 43          | 8           | 6700.0     | 6700.0     |
| 44          | 8           | 5000.0     | 5850.0     |
| 47          | 9           | 6500.0     | 6500.0     |
| 48          | 9           | 7200.0     | 6850.0     |
| 49          | 9           | 5000.0     | 6233.333333333333 |
| 50          | 9           | 5000.0     | 5925.0     |
| 51          | 9           | 5000.0     | 5740.0     |
| 52          | 9           | 5000.0     | 5616.666666666667 |
| 45          | 9           | 6800.0     | 5785.714285714285 |
| 46          | 9           | 6600.0     | 5887.5     |

{code}

The fix seems to be to include the rows with equal ORDER BY columns in the 
window frame, when compute window aggregate function.


> Drill window functions return wrong results
> -------------------------------------------
>
>                 Key: DRILL-1487
>                 URL: https://issues.apache.org/jira/browse/DRILL-1487
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>            Reporter: Neeraja
>
> Executing the following window function with the requirement to see how a 
> given employee salary would compare to the avg(salary) in his/her position. 
> Query executes fine however returns wrong results(expect the avg(salary) to 
> stay same for a given window (i.e position id)
> 0: jdbc:drill:zk=local> SELECT employee_id,position_id, salary, avg(salary) 
> OVER (PARTITION BY position_id order by position_id) FROM cp.`employee.json` 
> order by employee_id;
> +-------------+-------------+------------+------------+
> | employee_id | position_id |   salary   |   EXPR$3   |
> +-------------+-------------+------------+------------+
> | 1           | 1           | 80000.0    | 80000.0    |
> | 2           | 2           | 40000.0    | 37500.0    |
> | 4           | 2           | 40000.0    | 38333.333333333336 |
> | 5           | 2           | 35000.0    | 35000.0    |
> | 6           | 3           | 25000.0    | 25000.0    |
> | 7           | 4           | 15000.0    | 15000.0    |
> | 8           | 11          | 10000.0    | 14333.333333333334 |
> | 9           | 11          | 17000.0    | 17000.0    |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to