[ 
https://issues.apache.org/jira/browse/DRILL-4053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15004989#comment-15004989
 ] 

ASF GitHub Bot commented on DRILL-4053:
---------------------------------------

Github user parthchandra commented on the pull request:

    https://github.com/apache/drill/pull/254#issuecomment-156601055
  
    I used an alternate cache file because the code is simplistic and uses 
Jackson deserialization which essentially means that one needs to know the 
version of the file before one reads it. Right now I also ignore the older 
metadata cache file and remove it if refresh metadata is called.
    I could maintain multiple versions and dispatch on version, though I would 
still prefer to keep the version info in the file name. 
    Let me know if you think that is necessary.


> Reduce metadata cache file size
> -------------------------------
>
>                 Key: DRILL-4053
>                 URL: https://issues.apache.org/jira/browse/DRILL-4053
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Metadata
>    Affects Versions: 1.3.0
>            Reporter: Parth Chandra
>            Assignee: Parth Chandra
>             Fix For: 1.4.0
>
>
> The parquet metadata cache file has fair amount of redundant metadata that 
> causes the size of the cache file to bloat. Two things that we can reduce are 
> :
> 1) Schema is repeated for every row group. We can keep a merged schema 
> (similar to what was discussed for insert into functionality) 2) The max and 
> min value in the stats are used for partition pruning when the values are the 
> same. We can keep the maxValue only and that too only if it is the same as 
> the minValue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to