[jira] [Commented] (DRILL-2287) Filesystem partitioning is slow

Aman Sinha (JIRA) Sun, 22 Mar 2015 22:12:28 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-2287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14375407#comment-14375407
 ]


Aman Sinha commented on DRILL-2287:
-----------------------------------

Nothing elaborate..just the elapsed times for the queries that you ran, the 
corresponding Explain plans and the row count of the tables.  You can attach 
these to the JIRA as a single text file or multiple.  For performance tests, it 
is best to run the same query twice and take the second run's elapsed time to 
ensure that all queries are benefiting from potential caching at the file 
system level. 
I agree that removing the Filter node from the plan would be useful when it is 
established that all components of that filter have been pushed into the scan.  
However, we would need to do some performance evaluations to determine how much 
is the improvement and whether that justifies further work.   In fact, in 
version 0.7, we do remove the filter in this case, but I haven't done the 
performance comparison with a large enough data set. 


> Filesystem partitioning is slow
> -------------------------------
>
>                 Key: DRILL-2287
>                 URL: https://issues.apache.org/jira/browse/DRILL-2287
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Query Planning & Optimization
>    Affects Versions: 0.7.0, 0.8.0
>            Reporter: Adam Gilmore
>            Assignee: Jinfeng Ni
>            Priority: Minor
>             Fix For: 0.9.0
>
>
> We have created a number of Parquet files in different directories (e.g. 1, 
> 2, 3, 4) to partition our data on the filesystem.
> Assuming we only have 4 directories (1, 2, 3 and 4), when executing a query 
> like:
> {code:sql}
> select sum(price) from dfs.tmp.mydata where dir0 in (1, 2, 3, 4)
> {code}
> The query is significantly slower than:
> {code:sql}
> select sum(price) from dfs.tmp.mydata
> {code}
> Looking at the physical plans, it looks like even if dir0 is only in the 
> WHERE clause, it'll emit that from the scan, which then needs an extra step 
> (a projection) to only project through the count (removing dir0).  This 
> appears to be the cause of the slowdown.
> To make it even more confusing, if you only select the LAST directory (i.e. 
> in the case, 4), then it has a different physical plan again and seems to use 
> a union-exchange.
> Ultimately, the query planner should realise that dir0 is not projected and 
> then once the pushdown filesystem filtering is done, remove dir0 from being 
> emitted from the scan and not require a project.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-2287) Filesystem partitioning is slow

Reply via email to