[jira] [Commented] (DRILL-2953) Group By + Order By query results are not ordered.

Jinfeng Ni (JIRA) Mon, 11 May 2015 17:13:33 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-2953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14538939#comment-14538939
 ]


Jinfeng Ni commented on DRILL-2953:
-----------------------------------

Skipping the middle CAST(b) means the output of project will not have the 
sort-ness of a, cast(b), c, then the down-stream operator has to insert a SORT 
enforcer to get the sort-ness it requires.

Tried the following cases. Both the plan looks fine:

1) 
{code}
select cast(columns[0] as int) as nation_key  
from 
dfs_test.`file:/Users/jni/work/incubator-drill/exec/java-exec/target/test-classes/store/text/data/nations.csv`
 
group by columns[0], columns[1], columns[2] 
order by columns[0],  cast(columns[1] as int), columns[2]
{code}

Plan:

{code}
00-00    Screen
00-01      Project(nation_key=[$0])
00-02        SelectionVectorRemover
00-03          Sort(sort0=[$1], sort1=[$2], sort2=[$3], dir0=[ASC], dir1=[ASC], 
dir2=[ASC])
00-04            Project(nation_key=[CAST($0):INTEGER], EXPR$1=[$0], 
EXPR$2=[CAST($1):INTEGER], EXPR$3=[$2])
00-05              StreamAgg(group=[{0, 1, 2}])
00-06                Sort(sort0=[$0], sort1=[$1], sort2=[$2], dir0=[ASC], 
dir1=[ASC], dir2=[ASC])
00-07                  Project($f0=[ITEM($0, 0)], $f1=[ITEM($0, 1)], 
$f2=[ITEM($0, 2)])
00-08                    Scan(groupscan=[EasyGroupScan 
[selectionRoot=/Users/jni/work/incubator-drill/exec/java-exec/target/test-classes/store/text/data/nations.csv,
 numFiles=1, columns=[`columns`[0], `columns`[1], `columns`[2]], 
files=[file:/Users/jni/work/incubator-drill/exec/java-exec/target/test-classes/store/text/data/nations.csv]]])
        
{code}

2) 
{code}
select cast(columns[0] as int) as nation_key  
from 
dfs_test.`file:/Users/jni/work/incubator-drill/exec/java-exec/target/test-classes/store/text/data/nations.csv`
  
group by columns[0], columns[1], columns[2]  
order by columns[0],  columns[2]
{code}

Plan:
{code}
00-00    Screen
00-01      Project(nation_key=[$0])
00-02        SelectionVectorRemover
00-03          Sort(sort0=[$1], sort1=[$2], dir0=[ASC], dir1=[ASC])
00-04            Project(nation_key=[CAST($0):INTEGER], EXPR$1=[$0], 
EXPR$2=[$2])
00-05              StreamAgg(group=[{0, 1, 2}])
00-06                Sort(sort0=[$0], sort1=[$1], sort2=[$2], dir0=[ASC], 
dir1=[ASC], dir2=[ASC])
00-07                  Project($f0=[ITEM($0, 0)], $f1=[ITEM($0, 1)], 
$f2=[ITEM($0, 2)])
00-08                    Scan(groupscan=[EasyGroupScan 
[selectionRoot=/Users/jni/work/incubator-drill/exec/java-exec/target/test-classes/store/text/data/nations.csv,
 numFiles=1, columns=[`columns`[0], `columns`[1], `columns`[2]], 
files=[file:/Users/jni/work/incubator-drill/exec/java-exec/target/test-classes/store/text/data/nations.csv]]])
{code}

In the above two cases, a sort is inserted to ensure the sort-ness specified by 
the ORDERBY clause.


> Group By + Order By query results are not ordered.
> --------------------------------------------------
>
>                 Key: DRILL-2953
>                 URL: https://issues.apache.org/jira/browse/DRILL-2953
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Query Planning & Optimization
>    Affects Versions: 0.9.0
>         Environment: 10833d2cae9f5312cf0e31f8c9f3f8a9dcdc0c45 | Commit 0.9.0 
> release version. | 03.05.2015 @ 14:56:56 EDT
>            Reporter: Khurram Faraaz
>            Assignee: Jinfeng Ni
>            Priority: Critical
>             Fix For: 1.0.0
>
>         Attachments: 
> 0001-DRILL-2953-Ensure-sort-would-be-enforced-when-a-cast.patch
>
>
> Group by + order by query does not return results in correct order. Sort is 
> performed before the aggregation is done, which should not be the case.
> Test was performed on 4 node cluster on CentOS.
> {code}
> 0: jdbc:drill:> select cast(columns[0] as int) c1 from `testWindow.csv` t2 
> where t2.columns[0] is not null group by columns[0] order by columns[0];
> +------------+
> |     c1     |
> +------------+
> | 10         |
> | 100        |
> | 113        |
> | 119        |
> | 2          |
> | 50         |
> | 55         |
> | 57         |
> | 61         |
> | 67         |
> | 89         |
> +------------+
> 11 rows selected (0.218 seconds)
> {code}
> Explain plan for that query that returns wrong results.
> {code}
> 0: jdbc:drill:> explain plan for select cast(columns[0] as int) c1 from 
> `testWindow.csv` t2 where t2.columns[0] is not null group by columns[0] order 
> by columns[0];
> +------------+------------+
> |    text    |    json    |
> +------------+------------+
> | 00-00    Screen
> 00-01      Project(c1=[$0])
> 00-02        Project(c1=[CAST($0):INTEGER], EXPR$1=[$0])
> 00-03          StreamAgg(group=[{0}])
> 00-04            Sort(sort0=[$0], dir0=[ASC])
> 00-05              Filter(condition=[IS NOT NULL($0)])
> 00-06                Project(ITEM=[ITEM($0, 0)])
> 00-07                  Scan(groupscan=[EasyGroupScan 
> [selectionRoot=/tmp/testWindow.csv, numFiles=1, columns=[`columns`[0]], 
> files=[maprfs:/tmp/testWindow.csv]]])
> {code} 
> Incorrect results , not in order.
> {code}
> 0: jdbc:drill:> select cast(columns[0] as int) from `testWindow.csv` t2 where 
> t2.columns[0] is not null group by columns[0] order by columns[0];
> +------------+
> |   EXPR$0   |
> +------------+
> | 10         |
> | 100        |
> | 113        |
> | 119        |
> | 2          |
> | 50         |
> | 55         |
> | 57         |
> | 61         |
> | 67         |
> | 89         |
> +------------+
> 11 rows selected (0.214 seconds)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-2953) Group By + Order By query results are not ordered.

Reply via email to