[
https://issues.apache.org/jira/browse/DRILL-4053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15004900#comment-15004900
]
ASF GitHub Bot commented on DRILL-4053:
---------------------------------------
Github user parthchandra commented on the pull request:
https://github.com/apache/drill/pull/254#issuecomment-156589078
The last perf run was on 400K parquet files in a 2-layer directory
hierarchy:
root_dir/
2015-10-01/
d_01/
.
.
.
2015-10-31/
d_25/
That gave us 31x25 directories with 511 parquet files in each dir.
The numbers are with a warm file system cache.
Build Explain Plan Refresh metadata
Metadata size
MASTER (e275dc5) 145.854 s 225.083 s
2.9GB
with DRILL-4053 35.430 s 202.133 s
1.3GB
The Explain Plan query - "explain plan for select * from root_dir"
Refresh metadata query - "refresh table metadata dfs.root_dir "
> Reduce metadata cache file size
> -------------------------------
>
> Key: DRILL-4053
> URL: https://issues.apache.org/jira/browse/DRILL-4053
> Project: Apache Drill
> Issue Type: Improvement
> Components: Metadata
> Affects Versions: 1.3.0
> Reporter: Parth Chandra
> Assignee: Parth Chandra
> Fix For: 1.4.0
>
>
> The parquet metadata cache file has fair amount of redundant metadata that
> causes the size of the cache file to bloat. Two things that we can reduce are
> :
> 1) Schema is repeated for every row group. We can keep a merged schema
> (similar to what was discussed for insert into functionality) 2) The max and
> min value in the stats are used for partition pruning when the values are the
> same. We can keep the maxValue only and that too only if it is the same as
> the minValue.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)