[jira] [Comment Edited] (DRILL-7038) Queries on partitioned columns scan the entire datasets

Bridget Bevens (JIRA) Mon, 01 Apr 2019 14:50:07 -0700


    [ 
https://issues.apache.org/jira/browse/DRILL-7038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16805567#comment-16805567
 ]


Bridget Bevens edited comment on DRILL-7038 at 4/1/19 9:49 PM:
---------------------------------------------------------------

Thanks, [~KazydubB]!

Is this okay to add to the document?

Starting in Drill 1.16, Drill uses a Value operator instead of a Scan operator 
to read data when a query references a directory column only and has a DISTINCT 
or GROUP BY operation. Instead of scanning all directory columns, Drill either 
reads the specified column from the metadata cache file (if one exists) or 
Drill selects directly from the directory (partition location). The presence of 
the Values operator (instead of the Scan operator) in the query plan indicates 
that Drill is using this optimization, as shown in the following examples:  

select distinct dir0 from `/logs`;
+------+
| dir0 |
+------+
| 2015 |
| 2016 |
| 2017 |
+------+


explain plan for select distinct dir0 from `/logs`;
+----------------------------------------------------------------------------------+----------------------------------------------------------------------------------+
|                                       text                                    
   |                                       json                                 
      |
+----------------------------------------------------------------------------------+----------------------------------------------------------------------------------+
| 00-00    Screen
00-01      Project(dir0=[$0])
00-02        StreamAgg(group=[{0}])
00-03          Sort(sort0=[$0], dir0=[ASC])
00-04            Values(tuples=[[{ '2015' }, { '2015' }, { '2016' }, { '2015' 
}, { '2017' }, { '2015' }]])



select dir0 from `/logs` group by dir0;
+------+
| dir0 |
+------+
| 2015 |
| 2016 |
| 2017 |
+------+

explain plan for select dir0 from `/logs` group by dir0;
>
+----------------------------------------------------------------------------------+----------------------------------------------------------------------------------+
|                                       text                                    
   |                                       json                                 
      |
+----------------------------------------------------------------------------------+----------------------------------------------------------------------------------+
| 00-00    Screen
00-01      Project(dir0=[$0])
00-02        StreamAgg(group=[{0}])
00-03          Sort(sort0=[$0], dir0=[ASC])
00-04            Values(tuples=[[{ '2015' }, { '2015' }, { '2016' }, { '2015' 
}, { '2017' }, { '2015' }]])

Thanks,
Bridget


was (Author: bbevens):
Thanks, [~KazydubB]!

Is this okay to add to the document?

Starting in 1.16, Drill uses a Values operator instead of the Scan operator for 
DISTINCT and GROUP BY queries on tables and directories, as shown in the 
following examples:

select distinct dir0 from `/logs`;
+------+
| dir0 |
+------+
| 2015 |
| 2016 |
| 2017 |
+------+


explain plan for select distinct dir0 from `/logs`;
+----------------------------------------------------------------------------------+----------------------------------------------------------------------------------+
|                                       text                                    
   |                                       json                                 
      |
+----------------------------------------------------------------------------------+----------------------------------------------------------------------------------+
| 00-00    Screen
00-01      Project(dir0=[$0])
00-02        StreamAgg(group=[{0}])
00-03          Sort(sort0=[$0], dir0=[ASC])
00-04            Values(tuples=[[{ '2015' }, { '2015' }, { '2016' }, { '2015' 
}, { '2017' }, { '2015' }]])



select dir0 from `/logs` group by dir0;
+------+
| dir0 |
+------+
| 2015 |
| 2016 |
| 2017 |
+------+

explain plan for select dir0 from `/logs` group by dir0;
>
+----------------------------------------------------------------------------------+----------------------------------------------------------------------------------+
|                                       text                                    
   |                                       json                                 
      |
+----------------------------------------------------------------------------------+----------------------------------------------------------------------------------+
| 00-00    Screen
00-01      Project(dir0=[$0])
00-02        StreamAgg(group=[{0}])
00-03          Sort(sort0=[$0], dir0=[ASC])
00-04            Values(tuples=[[{ '2015' }, { '2015' }, { '2016' }, { '2015' 
}, { '2017' }, { '2015' }]])

Thanks,
Bridget

> Queries on partitioned columns scan the entire datasets
> -------------------------------------------------------
>
>                 Key: DRILL-7038
>                 URL: https://issues.apache.org/jira/browse/DRILL-7038
>             Project: Apache Drill
>          Issue Type: Improvement
>            Reporter: Bohdan Kazydub
>            Assignee: Bohdan Kazydub
>            Priority: Major
>              Labels: doc-impacting, ready-to-commit
>             Fix For: 1.16.0
>
>
> For tables with hive-style partitions like
> {code}
> /table/2018/Q1
> /table/2018/Q2
> /table/2019/Q1
> etc.
> {code}
> if any of the following queries is run:
> {code}
> select distinct dir0 from dfs.`/table`
> {code}
> {code}
> select dir0 from dfs.`/table` group by dir0
> {code}
> it will actually scan every single record in the table rather than just 
> getting a list of directories at the dir0 level. This applies even when 
> cached metadata is available. This is a big penalty especially as the 
> datasets grow.
> To avoid such situations, a logical prune rule can be used to collect 
> partition columns (`dir0`), either from metadata cache (if available) or 
> group scan, and drop unnecessary files from being read. The rule will be 
> applied on following conditions:
> 1) all queried columns are partitoin columns, and
> 2) either {{DISTINCT}} or {{GROUP BY}} operations are performed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (DRILL-7038) Queries on partitioned columns scan the entire datasets

Reply via email to