[jira] [Updated] (PARQUET-372) Parquet stats can have awkwardly large values

2018-04-21 Thread Gabor Szadovszky (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky updated PARQUET-372:
-
Fix Version/s: 1.8.2

> Parquet stats can have awkwardly large values
> -
>
> Key: PARQUET-372
> URL: https://issues.apache.org/jira/browse/PARQUET-372
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format, parquet-mr
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
> Fix For: 1.9.0, 1.8.2
>
>
> If a column is storing very large values, say 2-4 MB, then the page header's 
> min and max values can also be this large.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-372) Parquet stats can have awkwardly large values

2015-09-25 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-372:
--
Description: If a column is storing very large values, say 2-4 MB, then the 
page header's min and max values can also be this large.  (was: If a column is 
storing very large values, say 2-4 MB, then the page header's min and max 
values can also be this large. It is wasteful to keep that much data in a page 
header, so we should look at options for decreasing the size required in these 
cases.

One idea is to truncate the size of binary data and change the last byte to 
0xFF (max) or 0x00 (min) to get a roughly equivalent min and max that isn't 
huge. This probably has some problems when the data stores multi-byte 
characters in UTF8 so we have to be careful and look into byte-wise sorting and 
UTF8.)

> Parquet stats can have awkwardly large values
> -
>
> Key: PARQUET-372
> URL: https://issues.apache.org/jira/browse/PARQUET-372
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format, parquet-mr
>Reporter: Ryan Blue
>
> If a column is storing very large values, say 2-4 MB, then the page header's 
> min and max values can also be this large.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)