[
https://issues.apache.org/jira/browse/DRILL-2287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14375443#comment-14375443
]
Adam Gilmore commented on DRILL-2287:
-------------------------------------
The sample data is 3 million rows in Parquet format in 3 separate directories.
Here are the two queries I'm running.
{code}
0: jdbc:drill:zk=local> select sum(price), count(*) from dfs.tmp.purchases;
+------------+------------+
| EXPR$0 | EXPR$1 |
+------------+------------+
| 1.4988500549999561E7 | 3000000 |
+------------+------------+
1 row selected (0.237 seconds)
0: jdbc:drill:zk=local> select sum(price), count(*) from dfs.tmp.purchases
where dir0 in (1, 2, 3);
+------------+------------+
| EXPR$0 | EXPR$1 |
+------------+------------+
| 1.4988500549999561E7 | 3000000 |
+------------+------------+
1 row selected (0.539 seconds)
{code}
You can see the second query is more than twice as slow. I've run the queries
each multiple times to ensure that's an accurate reflection of the query time.
I've attached both plans and as you can see, the main difference is the extra
filter/project in the pruning plan.
As discussed, the best option would be to intelligently remove the
filter/project in instances where the filter is relatively simple.
> Filesystem partitioning is slow
> -------------------------------
>
> Key: DRILL-2287
> URL: https://issues.apache.org/jira/browse/DRILL-2287
> Project: Apache Drill
> Issue Type: Improvement
> Components: Query Planning & Optimization
> Affects Versions: 0.7.0, 0.8.0
> Reporter: Adam Gilmore
> Assignee: Jinfeng Ni
> Priority: Minor
> Fix For: 0.9.0
>
>
> We have created a number of Parquet files in different directories (e.g. 1,
> 2, 3, 4) to partition our data on the filesystem.
> Assuming we only have 4 directories (1, 2, 3 and 4), when executing a query
> like:
> {code:sql}
> select sum(price) from dfs.tmp.mydata where dir0 in (1, 2, 3, 4)
> {code}
> The query is significantly slower than:
> {code:sql}
> select sum(price) from dfs.tmp.mydata
> {code}
> Looking at the physical plans, it looks like even if dir0 is only in the
> WHERE clause, it'll emit that from the scan, which then needs an extra step
> (a projection) to only project through the count (removing dir0). This
> appears to be the cause of the slowdown.
> To make it even more confusing, if you only select the LAST directory (i.e.
> in the case, 4), then it has a different physical plan again and seems to use
> a union-exchange.
> Ultimately, the query planner should realise that dir0 is not projected and
> then once the pushdown filesystem filtering is done, remove dir0 from being
> emitted from the scan and not require a project.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)