[ 
https://issues.apache.org/jira/browse/DRILL-2287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Gilmore updated DRILL-2287:
--------------------------------
    Attachment: pruning.json
                no-pruning.json

> Filesystem partitioning is slow
> -------------------------------
>
>                 Key: DRILL-2287
>                 URL: https://issues.apache.org/jira/browse/DRILL-2287
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Query Planning & Optimization
>    Affects Versions: 0.7.0, 0.8.0
>            Reporter: Adam Gilmore
>            Assignee: Jinfeng Ni
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: no-pruning.json, pruning.json
>
>
> We have created a number of Parquet files in different directories (e.g. 1, 
> 2, 3, 4) to partition our data on the filesystem.
> Assuming we only have 4 directories (1, 2, 3 and 4), when executing a query 
> like:
> {code:sql}
> select sum(price) from dfs.tmp.mydata where dir0 in (1, 2, 3, 4)
> {code}
> The query is significantly slower than:
> {code:sql}
> select sum(price) from dfs.tmp.mydata
> {code}
> Looking at the physical plans, it looks like even if dir0 is only in the 
> WHERE clause, it'll emit that from the scan, which then needs an extra step 
> (a projection) to only project through the count (removing dir0).  This 
> appears to be the cause of the slowdown.
> To make it even more confusing, if you only select the LAST directory (i.e. 
> in the case, 4), then it has a different physical plan again and seems to use 
> a union-exchange.
> Ultimately, the query planner should realise that dir0 is not projected and 
> then once the pushdown filesystem filtering is done, remove dir0 from being 
> emitted from the scan and not require a project.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to