[ 
https://issues.apache.org/jira/browse/PARQUET-1685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16961165#comment-16961165
 ] 

Xinli Shang edited comment on PARQUET-1685 at 10/28/19 3:35 PM:
----------------------------------------------------------------

Hi [~gszadovszky] Thanks for your reply!  

Regarding "an implementation might rely on the fact that the min/max values are 
actual values", did you already have discussions earlier when the 'column 
index' implemented the **statistics truncating?  I would like to add  [~rdblue] 
who might already have discussions and thinkings because this is implemented in 
IceBerg. 

For the 4k hard limit, I am thinking from the other way.  If empty statistics 
were written because of oversizing statistics, it would cause the query 
inefficient.  And if truncating can improve(reduce) the size and as the result 
can reduce the number of empty statistics files, then it is a big win.

In 1.11.0+, is it enforced to use the 'column index' and not to write to page 
statistics? 

 

 


was (Author: [email protected]):
Hi [~gszadovszky] Thanks for your reply!  

Regarding "an implementation might rely on the fact that the min/max values are 
actual values", did you already have discussions earlier when the 'column 
index' implemented the **statistics truncating?  I would like to add  [~rdblue] 
who might already have discussions and thinkings because this is implemented in 
IceBerg. 

For the 4k hard limit, I am thinking from the other way.  If empty statistics 
were written because of oversizing statistics, it would cause the query 
inefficient.  And if truncating can improve(reduce) the size and as a result 
reduce the number of empty statistics files, then it is a big win.

In 1.11.0+, is it enforced to use the 'column index' and not to write to page 
statistics? 

 

 

> Truncate the stored min and max for String statistics to reduce the footer 
> size 
> --------------------------------------------------------------------------------
>
>                 Key: PARQUET-1685
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1685
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>    Affects Versions: 1.10.1
>            Reporter: Xinli Shang
>            Assignee: Xinli Shang
>            Priority: Major
>             Fix For: 1.12.0
>
>
> Iceberg has a cool feature that truncates the stored min, max statistics to 
> minimize the metadata size. We can borrow to truncate them in Parquet also to 
> reduce the size of the footer, or even the page header. Here is the code in 
> IceBerg 
> [https://github.com/apache/incubator-iceberg/blob/master/api/src/main/java/org/apache/iceberg/util/UnicodeUtil.java].
>  
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to