[jira] [Commented] (DRILL-4308) Aggregate operations on dir columns can be more efficient for certain use cases

Aman Sinha (JIRA) Mon, 25 Jan 2016 20:49:56 -0800

    [ 
https://issues.apache.org/jira/browse/DRILL-4308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15116679#comment-15116679
 ]


Aman Sinha commented on DRILL-4308:
-----------------------------------

[~jaltekruse] Right, I should have clarified that these types of queries may be 
generated by an external tool (e.g Tableau) so we would need to do a rule based 
rewrite to use mindir()/maxdir().  Actually, the second case (with DISTINCT) is 
the main reason I created the JIRA.  Using the show files output could be a 
reasonable approach..I haven't looked much into it. 

Incidentally, I am getting a wrong result for the second query below.  I would 
think it should produce the same result as the first query. (My directory 
structure is  year/quarter).  Instead the second query produces 'Q1' for dir0 
which is incorrect.  Any thoughts ? If you think this is an issue, I can file a 
separate JIRA.  
{noformat}
0: jdbc:drill:zk=local> select dir0 from dfs.tmp.testdata order by dir0 limit 1;
+-------+
| dir0  |
+-------+
| 1994  |
+-------+
1 row selected (0.842 seconds)
0: jdbc:drill:zk=local> select dir0 from dfs.tmp.testdata where 
dir0=mindir('dfs.tmp', 'testdata') limit 1;
+-------+
| dir0  |
+-------+
| Q1    |
+-------+
1 row selected (0.311 seconds)
{noformat}

> Aggregate operations on dir<N> columns can be more efficient for certain use 
> cases
> ----------------------------------------------------------------------------------
>
>                 Key: DRILL-4308
>                 URL: https://issues.apache.org/jira/browse/DRILL-4308
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Execution - Relational Operators
>    Affects Versions: 1.4.0
>            Reporter: Aman Sinha
>
> For queries that perform plain aggregates or DISTINCT operations on the 
> directory partition columns (dir0, dir1 etc.) and there are no other columns 
> referenced in the query, the performance could be substantially improved by 
> not having to scan the entire dataset.   
> Consider the following types of queries:
> {noformat}
> select  min(dir0) from largetable;
> select  distinct dir0 from largetable;
> {noformat}
> The number of distinct values of dir<N> columns is typically quite small and 
> there's no reason to scan the large table.  This is also come as user 
> feedback from some Drill users.  Of course, if there's any other column 
> referenced in the query (WHERE, ORDER-BY etc.) then we cannot apply this 
> optimization.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4308) Aggregate operations on dir columns can be more efficient for certain use cases

Reply via email to