[ 
https://issues.apache.org/jira/browse/DRILL-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15105993#comment-15105993
 ] 

Jinfeng Ni commented on DRILL-2517:
-----------------------------------

Pull request: https://github.com/apache/drill/pull/328/files 

The PR contains both the change from Adam and Mehant. I added some code change 
on top of their change.

I did some preliminary performance comparison on my Mac laptop.  With 115k 
parquet files in total, it's organized in 25 directories (1990, 1991, ... ), 
and each directory has four subdirectories (Q1, Q2, Q3, Q4). 

For the following query : 
{code}
explain plan for select * from t1 where dir0= 1990 and dir1 = 'Q1';
{code}

Master branch shows 19.4 seconds,  DRLL-2517 patch shows 8.8 seconds. Both 
cases are measured for the second run with warm cache. 
{code}
1 row selected (19.434 seconds)

1 row selected (8.845 seconds)
{code} 

The log shows that the time for reading parquet meta data from footer files is 
significantly reduced (from 7388ms to 102ms) , due the the pruning effect. 

On master branch: 
{code}
Fetch parquet metadata: Executed 115544 out of 115544 using 16 threads. Time: 
7388ms total, 1.019393ms avg, 745ms max.
{code}

With patch:
{code}
Fetch parquet metadata: Executed 1111 out of 1111 using 16 threads. Time: 102ms 
total, 1.053320ms avg, 8ms max.
{code}


> Apply Partition pruning before reading files during planning
> ------------------------------------------------------------
>
>                 Key: DRILL-2517
>                 URL: https://issues.apache.org/jira/browse/DRILL-2517
>             Project: Apache Drill
>          Issue Type: New Feature
>          Components: Query Planning & Optimization
>    Affects Versions: 0.7.0, 0.8.0
>            Reporter: Adam Gilmore
>            Assignee: Jinfeng Ni
>             Fix For: Future
>
>
> Partition pruning still tries to read Parquet files during the planning stage 
> even though they don't match the partition filter.
> For example, if there were an invalid Parquet file in a directory that should 
> not be queried:
> {code}
> 0: jdbc:drill:zk=local> select sum(price) from dfs.tmp.purchases where dir0 = 
> 1;
> Query failed: IllegalArgumentException: file:/tmp/purchases/4/0_0_0.parquet 
> is not a Parquet file (too small)
> {code}
> The reason is that the partition pruning happens after the Parquet plugin 
> tries to read the footer of each file.
> Ideally, partition pruning would happen first before the format plugin gets 
> involved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to