[jira] [Updated] (PARQUET-1911) Add way to disables statistics on a per column basis

Anthony Pessy (Jira) Tue, 08 Sep 2020 06:00:20 -0700


     [ 
https://issues.apache.org/jira/browse/PARQUET-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Anthony Pessy updated PARQUET-1911:
-----------------------------------
    Description: 
When you write dataset with BINARY columns that can be fairly large (several 
Mbs) you can often end with an OutOfMemory error where you either have to:

 

 - Throw more RAM

 - Increase number of output files

 - Play with Block size

 

Using a fork with increased checks frequency for row group size help but it is 
not enough. (PR: [https://github.com/apache/parquet-mr/pull/470])

 

 

The OutOfMemory error is now caused due to the accumulation of min/max values 
for those columns for each BlockMetaData.

 

The "parquet.statistics.truncate.length" configuration is of no help because it 
is applied during the footer serialization whereas the OOM occurs before that.

 

I think it would be nice to have, like for dictionary or bloom filter, a way to 
disable the statistic on a per-column basis.

 

Could be very useful to lower memory consumption when stats of huge binary 
column are unnecessary.

 

 

 

  was:
When you have a dataset with BINARY columns that can be fairly large (several 
Mbs) you can often end with an OutOfMemory error where you either have to:

 

 - Throw more RAM

 - Increase number of output files

 - Play with Block size

 

Using a fork with increased checks frequency for row group size help but it is 
not enough. (PR: [https://github.com/apache/parquet-mr/pull/470])

 

 

The OutOfMemory error is now caused due to the accumulation of min/max values 
for those columns for each BlockMetaData.

 

The "parquet.statistics.truncate.length" configuration is of no help because it 
is applied during the footer serialization whereas the OOM occurs before that.

 

I think it would be nice to have, like for dictionary or bloom filter, a way to 
disable the statistic on a per-column basis.

 

Could be very useful to lower memory consumption when stats of huge binary 
column are unnecessary.

 

 

 


> Add way to disables statistics on a per column basis
> ----------------------------------------------------
>
>                 Key: PARQUET-1911
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1911
>             Project: Parquet
>          Issue Type: New Feature
>          Components: parquet-mr
>            Reporter: Anthony Pessy
>            Priority: Major
>
> When you write dataset with BINARY columns that can be fairly large (several 
> Mbs) you can often end with an OutOfMemory error where you either have to:
>  
>  - Throw more RAM
>  - Increase number of output files
>  - Play with Block size
>  
> Using a fork with increased checks frequency for row group size help but it 
> is not enough. (PR: [https://github.com/apache/parquet-mr/pull/470])
>  
>  
> The OutOfMemory error is now caused due to the accumulation of min/max values 
> for those columns for each BlockMetaData.
>  
> The "parquet.statistics.truncate.length" configuration is of no help because 
> it is applied during the footer serialization whereas the OOM occurs before 
> that.
>  
> I think it would be nice to have, like for dictionary or bloom filter, a way 
> to disable the statistic on a per-column basis.
>  
> Could be very useful to lower memory consumption when stats of huge binary 
> column are unnecessary.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (PARQUET-1911) Add way to disables statistics on a per column basis

Reply via email to