[jira] [Commented] (DRILL-6159) No need to offset rows if order by is not specified in the query.

Hanumath Rao Maduri (JIRA) Thu, 15 Feb 2018 12:34:18 -0800

    [ 
https://issues.apache.org/jira/browse/DRILL-6159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366221#comment-16366221
 ]


Hanumath Rao Maduri commented on DRILL-6159:
--------------------------------------------

Thank you for the input. I agree that in some database (mostly single instance 
databases) when this type of query is issued by the user all they care is some 
natural order stored in the database for pagination. However, I think in 
context of the distributed database like DRILL there is no natural order. In my 
opinion, even after doing all the processing we might produce some duplicate 
results across query runs. Given this scenario, I was just thinking if 
processing offset is useful at all.

I also think, this optimization can also inhibit users from issuing these kind 
of queries when they get same result no matter what offset is provided. 

However, I just opened this JIRA so as to reach a consensus on this issue. I am 
also fine if the consensus among drillers is not to fix this issue.

> No need to offset rows if order by is not specified in the query.
> -----------------------------------------------------------------
>
>                 Key: DRILL-6159
>                 URL: https://issues.apache.org/jira/browse/DRILL-6159
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Query Planning &amp; Optimization
>    Affects Versions: 1.12.0
>            Reporter: Hanumath Rao Maduri
>            Assignee: Hanumath Rao Maduri
>            Priority: Major
>             Fix For: Future
>
>
> For the queries which have offset and limit and no order by no need to add 
> the offset to limit during pushdown of the limit.
> Sql doesn't guarantee order in the output if no order by is specified in the 
> query. It is observed that for the queries with offset and limit and no order 
> by, current optimizer is adding the offset and limit and limiting those many 
> rows. Doing so will not early exit the query.
> Here is an example for a query.
> {code}
> select zz1,zz2,a11 from dfs.tmp.viewtmp limit 100000 offset 10000000
> 00-00    Screen : rowType = RecordType(ANY zz1, ANY zz2, ANY a11): rowcount = 
> 1.01E7, cumulative cost = {1.06048844E8 rows, 5.54015404E8 cpu, 0.0 io, 
> 1.56569100288E11 network, 4.64926176E7 memory}, id = 787
> 00-01      Project(zz1=[$0], zz2=[$1], a11=[$2]) : rowType = RecordType(ANY 
> zz1, ANY zz2, ANY a11): rowcount = 1.01E7, cumulative cost = {1.05038844E8 
> rows, 5.53005404E8 cpu, 0.0 io, 1.56569100288E11 network, 4.64926176E7 
> memory}, id = 786
> 00-02        SelectionVectorRemover : rowType = RecordType(ANY zz1, ANY zz2, 
> ANY a11): rowcount = 1.01E7, cumulative cost = {1.05038844E8 rows, 
> 5.53005404E8 cpu, 0.0 io, 1.56569100288E11 network, 4.64926176E7 memory}, id 
> = 785
> 00-03          Limit(offset=[10000000], fetch=[100000]) : rowType = 
> RecordType(ANY zz1, ANY zz2, ANY a11): rowcount = 1.01E7, cumulative cost = 
> {9.4938844E7 rows, 5.42905404E8 cpu, 0.0 io, 1.56569100288E11 network, 
> 4.64926176E7 memory}, id = 784
> 00-04            UnionExchange : rowType = RecordType(ANY zz1, ANY zz2, ANY 
> a11): rowcount = 1.01E7, cumulative cost = {8.4838844E7 rows, 5.02505404E8 
> cpu, 0.0 io, 1.56569100288E11 network, 4.64926176E7 memory}, id = 783
> 01-01              SelectionVectorRemover : rowType = RecordType(ANY zz1, ANY 
> zz2, ANY a11): rowcount = 1.01E7, cumulative cost = {7.4738844E7 rows, 
> 4.21705404E8 cpu, 0.0 io, 3.2460300288E10 network, 4.64926176E7 memory}, id = 
> 782
> 01-02                Limit(fetch=[10100000]) : rowType = RecordType(ANY zz1, 
> ANY zz2, ANY a11): rowcount = 1.01E7, cumulative cost = {6.4638844E7 rows, 
> 4.11605404E8 cpu, 0.0 io, 3.2460300288E10 network, 4.64926176E7 memory}, id = 
> 781
> 01-03                  Project(zz1=[$0], zz2=[$2], a11=[$1]) : rowType = 
> RecordType(ANY zz1, ANY zz2, ANY a11): rowcount = 2.3306983E7, cumulative 
> cost = {5.4538844E7 rows, 3.71205404E8 cpu, 0.0 io, 3.2460300288E10 network, 
> 4.64926176E7 memory}, id = 780
> 01-04                    HashJoin(condition=[=($0, $2)], joinType=[left]) : 
> rowType = RecordType(ANY ZZ1, ANY A, ANY ZZ2): rowcount = 2.3306983E7, 
> cumulative cost = {5.4538844E7 rows, 3.71205404E8 cpu, 0.0 io, 
> 3.2460300288E10 network, 4.64926176E7 memory}, id = 779
> 01-06                      Scan(groupscan=[EasyGroupScan 
> [selectionRoot=maprfs:/tmp/csvd1, numFiles=3, columns=[`ZZ1`, `A`], 
> files=[maprfs:/tmp/csvd1/D1111aamulti11random2.csv, 
> maprfs:/tmp/csvd1/D1111aamulti11random21.csv, 
> maprfs:/tmp/csvd1/D1111aamulti11random211.csv]]]) : rowType = RecordType(ANY 
> ZZ1, ANY A): rowcount = 2.3306983E7, cumulative cost = {2.3306983E7 rows, 
> 4.6613966E7 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 776
> 01-05                      BroadcastExchange : rowType = RecordType(ANY ZZ2): 
> rowcount = 2641626.0, cumulative cost = {5283252.0 rows, 2.3774634E7 cpu, 0.0 
> io, 3.2460300288E10 network, 0.0 memory}, id = 778
> 02-01                        Scan(groupscan=[EasyGroupScan 
> [selectionRoot=maprfs:/tmp/csvd2, numFiles=1, columns=[`ZZ2`], 
> files=[maprfs:/tmp/csvd2/D222random2.csv]]]) : rowType = RecordType(ANY ZZ2): 
> rowcount = 2641626.0, cumulative cost = {2641626.0 rows, 2641626.0 cpu, 0.0 
> io, 0.0 network, 0.0 memory}, id = 777
> {code}
> The limit pushed down is  Limit(fetch=[10100000]) instead it should be  
> Limit(fetch=[100000])
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (DRILL-6159) No need to offset rows if order by is not specified in the query.

Reply via email to