[jira] [Commented] (DRILL-4053) Reduce metadata cache file size

ASF GitHub Bot (JIRA) Fri, 13 Nov 2015 15:26:45 -0800

    [ 
https://issues.apache.org/jira/browse/DRILL-4053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15004900#comment-15004900
 ]


ASF GitHub Bot commented on DRILL-4053:
---------------------------------------

Github user parthchandra commented on the pull request:

    https://github.com/apache/drill/pull/254#issuecomment-156589078
  
    The last perf run was on 400K parquet files in a 2-layer directory 
hierarchy:
    root_dir/ 
      2015-10-01/
         d_01/
     .
     .
     .
       2015-10-31/
         d_25/
    
    That gave us 31x25 directories with 511 parquet files in each dir.
    
    The numbers are with a warm file system cache.
    
    Build                            Explain Plan     Refresh metadata      
Metadata size
    MASTER (e275dc5)    145.854 s               225.083 s                  
2.9GB                
    with DRILL-4053           35.430 s               202.133    s               
    1.3GB
    
    
    The Explain Plan query - "explain plan for select * from root_dir"
    Refresh metadata query - "refresh table metadata dfs.root_dir " 
    



> Reduce metadata cache file size
> -------------------------------
>
>                 Key: DRILL-4053
>                 URL: https://issues.apache.org/jira/browse/DRILL-4053
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Metadata
>    Affects Versions: 1.3.0
>            Reporter: Parth Chandra
>            Assignee: Parth Chandra
>             Fix For: 1.4.0
>
>
> The parquet metadata cache file has fair amount of redundant metadata that 
> causes the size of the cache file to bloat. Two things that we can reduce are 
> :
> 1) Schema is repeated for every row group. We can keep a merged schema 
> (similar to what was discussed for insert into functionality) 2) The max and 
> min value in the stats are used for partition pruning when the values are the 
> same. We can keep the maxValue only and that too only if it is the same as 
> the minValue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4053) Reduce metadata cache file size

Reply via email to