[ 
https://issues.apache.org/jira/browse/DRILL-2743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14532400#comment-14532400
 ] 

Adam Gilmore commented on DRILL-2743:
-------------------------------------

This will also clash with my patch for DRILL-1950 which uses the footers to 
filter out row groups based on statistics (which needs to be at the planning 
stage to assess whether the pushdown filter would actually be less costly and 
have the optimizer pick that plan).

It would be great to have the row group metadata cached (including the row 
group's statistics etc.).

> Parquet file metadata caching
> -----------------------------
>
>                 Key: DRILL-2743
>                 URL: https://issues.apache.org/jira/browse/DRILL-2743
>             Project: Apache Drill
>          Issue Type: New Feature
>          Components: Storage - Parquet
>            Reporter: Steven Phillips
>            Assignee: Steven Phillips
>             Fix For: 1.0.0
>
>         Attachments: DRILL-2743.patch, drill.parquet_metadata
>
>
> To run a query against parquet files, we have to first recursively search the 
> directory tree for all of the files, get the block locations for each file, 
> and read the footer from each file, and this is done during the planning 
> phase. When there are many files, this can result in a very large delay in 
> running the query, and it does not scale.
> However, there isn't really any need to read the footers during planning, if 
> we instead treat each parquet file as a single work unit, all we need to know 
> are the block locations for the file, the number of rows, and the columns. We 
> should store only the information which we need for planning in a file 
> located in the top directory for a given parquet table, and then we can delay 
> reading of the footers until execution time, which can be done in parallel.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to