[ https://issues.apache.org/jira/browse/PARQUET-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Anthony Pessy updated PARQUET-1911: ----------------------------------- Description: When you write dataset with BINARY columns that can be fairly large (several Mbs) you can often end with an OutOfMemory error where you either have to: - Throw more RAM - Increase number of output files - Play with Block size Using a fork with increased checks frequency for row group size help but it is not enough. (PR: [https://github.com/apache/parquet-mr/pull/470]) The OutOfMemory error is now caused due to the accumulation of min/max values for those columns for each BlockMetaData. The "parquet.statistics.truncate.length" configuration is of no help because it is applied during the footer serialization whereas the OOM occurs before that. I think it would be nice to have, like for dictionary or bloom filter, a way to disable the statistic on a per-column basis. Could be very useful to lower memory consumption when stats of huge binary column are unnecessary. was: When you have a dataset with BINARY columns that can be fairly large (several Mbs) you can often end with an OutOfMemory error where you either have to: - Throw more RAM - Increase number of output files - Play with Block size Using a fork with increased checks frequency for row group size help but it is not enough. (PR: [https://github.com/apache/parquet-mr/pull/470]) The OutOfMemory error is now caused due to the accumulation of min/max values for those columns for each BlockMetaData. The "parquet.statistics.truncate.length" configuration is of no help because it is applied during the footer serialization whereas the OOM occurs before that. I think it would be nice to have, like for dictionary or bloom filter, a way to disable the statistic on a per-column basis. Could be very useful to lower memory consumption when stats of huge binary column are unnecessary. > Add way to disables statistics on a per column basis > ---------------------------------------------------- > > Key: PARQUET-1911 > URL: https://issues.apache.org/jira/browse/PARQUET-1911 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr > Reporter: Anthony Pessy > Priority: Major > > When you write dataset with BINARY columns that can be fairly large (several > Mbs) you can often end with an OutOfMemory error where you either have to: > > - Throw more RAM > - Increase number of output files > - Play with Block size > > Using a fork with increased checks frequency for row group size help but it > is not enough. (PR: [https://github.com/apache/parquet-mr/pull/470]) > > > The OutOfMemory error is now caused due to the accumulation of min/max values > for those columns for each BlockMetaData. > > The "parquet.statistics.truncate.length" configuration is of no help because > it is applied during the footer serialization whereas the OOM occurs before > that. > > I think it would be nice to have, like for dictionary or bloom filter, a way > to disable the statistic on a per-column basis. > > Could be very useful to lower memory consumption when stats of huge binary > column are unnecessary. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)