[
https://issues.apache.org/jira/browse/DRILL-2287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Adam Gilmore updated DRILL-2287:
--------------------------------
Attachment: pruning.json
no-pruning.json
> Filesystem partitioning is slow
> -------------------------------
>
> Key: DRILL-2287
> URL: https://issues.apache.org/jira/browse/DRILL-2287
> Project: Apache Drill
> Issue Type: Improvement
> Components: Query Planning & Optimization
> Affects Versions: 0.7.0, 0.8.0
> Reporter: Adam Gilmore
> Assignee: Jinfeng Ni
> Priority: Minor
> Fix For: 0.9.0
>
> Attachments: no-pruning.json, pruning.json
>
>
> We have created a number of Parquet files in different directories (e.g. 1,
> 2, 3, 4) to partition our data on the filesystem.
> Assuming we only have 4 directories (1, 2, 3 and 4), when executing a query
> like:
> {code:sql}
> select sum(price) from dfs.tmp.mydata where dir0 in (1, 2, 3, 4)
> {code}
> The query is significantly slower than:
> {code:sql}
> select sum(price) from dfs.tmp.mydata
> {code}
> Looking at the physical plans, it looks like even if dir0 is only in the
> WHERE clause, it'll emit that from the scan, which then needs an extra step
> (a projection) to only project through the count (removing dir0). This
> appears to be the cause of the slowdown.
> To make it even more confusing, if you only select the LAST directory (i.e.
> in the case, 4), then it has a different physical plan again and seems to use
> a union-exchange.
> Ultimately, the query planner should realise that dir0 is not projected and
> then once the pushdown filesystem filtering is done, remove dir0 from being
> emitted from the scan and not require a project.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)