[jira] [Commented] (DRILL-2743) Parquet file metadata caching

Rahul Challapalli (JIRA) Wed, 19 Aug 2015 16:48:01 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-2743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14703986#comment-14703986
 ]


Rahul Challapalli commented on DRILL-2743:
------------------------------------------

Steven,
 
1. How do we verify that we are not reading the footers during execution? Apart 
from the planning taking less time, do we log anything to indicate that 
metadata cache is leveraged. This could be important for a support guy trying 
to debug a customer query.
2. Does having both files and sub-directories under a directory change anything?
3. I want to validate the contents of the cache file created with a test case. 
So if someone adds something to this cache file, a test would fail. So do you 
see any further changes to the format of the cache file for this Jira?
4. If one user creates this metadata file for a folder and if a different user 
executes a query on the same folder, does the planner use the metadata file? 
Currently I see "-rwxr-xr-x 3 root root" on the metadata file. So other users 
who are not part of the same group as root cannot read the file?

- Rahul

> Parquet file metadata caching
> -----------------------------
>
>                 Key: DRILL-2743
>                 URL: https://issues.apache.org/jira/browse/DRILL-2743
>             Project: Apache Drill
>          Issue Type: New Feature
>          Components: Storage - Parquet
>            Reporter: Steven Phillips
>            Assignee: Aman Sinha
>             Fix For: 1.2.0
>
>         Attachments: DRILL-2743.patch, drill.parquet_metadata
>
>
> To run a query against parquet files, we have to first recursively search the 
> directory tree for all of the files, get the block locations for each file, 
> and read the footer from each file, and this is done during the planning 
> phase. When there are many files, this can result in a very large delay in 
> running the query, and it does not scale.
> However, there isn't really any need to read the footers during planning, if 
> we instead treat each parquet file as a single work unit, all we need to know 
> are the block locations for the file, the number of rows, and the columns. We 
> should store only the information which we need for planning in a file 
> located in the top directory for a given parquet table, and then we can delay 
> reading of the footers until execution time, which can be done in parallel.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-2743) Parquet file metadata caching

Reply via email to