[jira] [Commented] (DRILL-2287) Filesystem partitioning is slow

Adam Gilmore (JIRA) Sun, 22 Mar 2015 21:22:43 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-2287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14375370#comment-14375370
 ]


Adam Gilmore commented on DRILL-2287:
-------------------------------------

I'm using the latest code from master so it's 0.8.  What's the best way for me 
to provide you some performance numbers?  Is there a standard set of metrics I 
should send through?

I get the partition pruning purpose and it makes sense that it may not always 
be possible to remove the partition filter from the filter for the reasons you 
mentioned.

While the partition pruning is useful in terms of reducing I/O reads, it makes 
the queries more expensive due to the extra filter/project required in the 
plan.  It'd be nice if we had some form of intelligence to rewrite the plan if 
we can remove it, though, even if that is fairly rudimentary at the moment.

I'm not sure how difficult that latter part would be, but it would certainly 
get some nice gains in performance for simple queries.

> Filesystem partitioning is slow
> -------------------------------
>
>                 Key: DRILL-2287
>                 URL: https://issues.apache.org/jira/browse/DRILL-2287
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Query Planning & Optimization
>    Affects Versions: 0.7.0, 0.8.0
>            Reporter: Adam Gilmore
>            Assignee: Jinfeng Ni
>            Priority: Minor
>             Fix For: 0.9.0
>
>
> We have created a number of Parquet files in different directories (e.g. 1, 
> 2, 3, 4) to partition our data on the filesystem.
> Assuming we only have 4 directories (1, 2, 3 and 4), when executing a query 
> like:
> {code:sql}
> select sum(price) from dfs.tmp.mydata where dir0 in (1, 2, 3, 4)
> {code}
> The query is significantly slower than:
> {code:sql}
> select sum(price) from dfs.tmp.mydata
> {code}
> Looking at the physical plans, it looks like even if dir0 is only in the 
> WHERE clause, it'll emit that from the scan, which then needs an extra step 
> (a projection) to only project through the count (removing dir0).  This 
> appears to be the cause of the slowdown.
> To make it even more confusing, if you only select the LAST directory (i.e. 
> in the case, 4), then it has a different physical plan again and seems to use 
> a union-exchange.
> Ultimately, the query planner should realise that dir0 is not projected and 
> then once the pushdown filesystem filtering is done, remove dir0 from being 
> emitted from the scan and not require a project.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-2287) Filesystem partitioning is slow

Reply via email to