[jira] [Comment Edited] (DRILL-4308) Aggregate operations on dir columns can be more efficient for certain use cases

Jason Altekruse (JIRA) Fri, 29 Jan 2016 14:28:53 -0800

    [ 
https://issues.apache.org/jira/browse/DRILL-4308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15124340#comment-15124340
 ]


Jason Altekruse edited comment on DRILL-4308 at 1/29/16 10:27 PM:
------------------------------------------------------------------

Hey [~amansinha100], I tried re-creating this and I was not able to see this 
behavior. I only created the folder structure on my local machine, but it 
looked like this, I seem to be getting correct results for these types of 
queries.

{code}
0: jdbc:drill:zk=local> select dir0 from mock_data where dir0 = 
mindir('dfs.mxd','mock_data') limit 1;
+-------+
| dir0  |
+-------+
| 1994  |
+-------+
1 row selected (0.127 seconds)
0: jdbc:drill:zk=local> select dir0 from mock_data where dir0 = 
maxdir('dfs.mxd','mock_data') limit 1;
+-------+
| dir0  |
+-------+
| 1997  |
+-------+
1 row selected (0.123 seconds)



Jasons-MacBook-Pro:maxdir jaltekruse$ tree mock_data/
mock_data/
├── 1994
│   ├── Q1
│   │   └── data.csv
│   ├── Q2
│   │   └── data.csv
│   ├── Q3
│   │   └── data.csv
│   └── Q4
│       └── data.csv
├── 1995
│   ├── Q1
│   │   └── data.csv
│   ├── Q2
│   │   └── data.csv
│   ├── Q3
│   │   └── data.csv
│   └── Q4
│       └── data.csv
├── 1996
│   ├── Q1
│   │   └── data.csv
│   ├── Q2
│   │   └── data.csv
│   ├── Q3
│   │   └── data.csv
│   └── Q4
│       └── data.csv
└── 1997
    ├── Q1
    │   └── data.csv
    ├── Q2
    │   └── data.csv
    ├── Q3
    │   └── data.csv
    └── Q4
        └── data.csv
{code}


was (Author: jaltekruse):
Hey [~amansinha100], I tried re-creating this and I was not able to see this 
behavior. I only created the folder structure on my local machine, but it 
looked like this, I seems to be getting correct results for these types of 
queries.

{code}
0: jdbc:drill:zk=local> select dir0 from mock_data where dir0 = 
mindir('dfs.mxd','mock_data') limit 1;
+-------+
| dir0  |
+-------+
| 1994  |
+-------+
1 row selected (0.127 seconds)
0: jdbc:drill:zk=local> select dir0 from mock_data where dir0 = 
maxdir('dfs.mxd','mock_data') limit 1;
+-------+
| dir0  |
+-------+
| 1997  |
+-------+
1 row selected (0.123 seconds)



Jasons-MacBook-Pro:maxdir jaltekruse$ tree mock_data/
mock_data/
├── 1994
│   ├── Q1
│   │   └── data.csv
│   ├── Q2
│   │   └── data.csv
│   ├── Q3
│   │   └── data.csv
│   └── Q4
│       └── data.csv
├── 1995
│   ├── Q1
│   │   └── data.csv
│   ├── Q2
│   │   └── data.csv
│   ├── Q3
│   │   └── data.csv
│   └── Q4
│       └── data.csv
├── 1996
│   ├── Q1
│   │   └── data.csv
│   ├── Q2
│   │   └── data.csv
│   ├── Q3
│   │   └── data.csv
│   └── Q4
│       └── data.csv
└── 1997
    ├── Q1
    │   └── data.csv
    ├── Q2
    │   └── data.csv
    ├── Q3
    │   └── data.csv
    └── Q4
        └── data.csv
{code}

> Aggregate operations on dir<N> columns can be more efficient for certain use 
> cases
> ----------------------------------------------------------------------------------
>
>                 Key: DRILL-4308
>                 URL: https://issues.apache.org/jira/browse/DRILL-4308
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Execution - Relational Operators
>    Affects Versions: 1.4.0
>            Reporter: Aman Sinha
>
> For queries that perform plain aggregates or DISTINCT operations on the 
> directory partition columns (dir0, dir1 etc.) and there are no other columns 
> referenced in the query, the performance could be substantially improved by 
> not having to scan the entire dataset.   
> Consider the following types of queries:
> {noformat}
> select  min(dir0) from largetable;
> select  distinct dir0 from largetable;
> {noformat}
> The number of distinct values of dir<N> columns is typically quite small and 
> there's no reason to scan the large table.  This is also come as user 
> feedback from some Drill users.  Of course, if there's any other column 
> referenced in the query (WHERE, ORDER-BY etc.) then we cannot apply this 
> optimization.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (DRILL-4308) Aggregate operations on dir columns can be more efficient for certain use cases

Reply via email to