[jira] [Created] (DRILL-2287) Filesystem partitioning is slow

Adam Gilmore (JIRA) Sun, 22 Feb 2015 17:50:37 -0800

Adam Gilmore created DRILL-2287:
-----------------------------------

             Summary: Filesystem partitioning is slow
                 Key: DRILL-2287
                 URL: https://issues.apache.org/jira/browse/DRILL-2287
             Project: Apache Drill
          Issue Type: Improvement
          Components: Query Planning & Optimization
    Affects Versions: 0.7.0, 0.8.0
            Reporter: Adam Gilmore
            Assignee: Jinfeng Ni
            Priority: Minor



We have created a number of Parquet files in different directories (e.g. 1, 2, 
3, 4) to partition our data on the filesystem.

Assuming we only have 4 directories (1, 2, 3 and 4), when executing a query 
like:

{code:sql}
select count(*) from dfs.tmp.mydata where dir0 in (1, 2, 3, 4)
{code}

The query is significantly slower than:

{code:sql}
select count(*) from dfs.tmp.mydata
{code}

Looking at the physical plans, it looks like even if dir0 is only in the WHERE 
clause, it'll emit that from the scan, which then needs an extra step (a 
projection) to only project through the count (removing dir0).  This appears to 
be the cause of the slowdown.

To make it even more confusing, if you only select the LAST directory (i.e. in 
the case, 4), then it has a different physical plan again and seems to use a 
union-exchange.

Ultimately, the query planner should realise that dir0 is not projected and 
then once the pushdown filesystem filtering is done, remove dir0 from being 
emitted from the scan and not require a project.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (DRILL-2287) Filesystem partitioning is slow

Reply via email to