[
https://issues.apache.org/jira/browse/PARQUET-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17465146#comment-17465146
]
Anthony Pessy commented on PARQUET-1911:
----------------------------------------
[~igor.berman] One of the use case I have is for long term storage of HTML
payloads, the dataset contains columns such as `url, status_code, fetch_date,
document_html` sorted by `url`.
I'm using predicate pushdowns filters on URL to quickly find the relevant
document and it work well, in this dataset I do not care about keeping
statistics for the `document_html` column which was producing the OOM,
especially for websites where all pages where around ~7-8Mb of payload.
I also use this feature in many intermediate parquet files I'm writing using
Spark and/or HadoopJob that contains similar data & it heavily helped with the
memory requirements.
I added the new `NoOpStatistics` file
> Add way to disables statistics on a per column basis
> ----------------------------------------------------
>
> Key: PARQUET-1911
> URL: https://issues.apache.org/jira/browse/PARQUET-1911
> Project: Parquet
> Issue Type: New Feature
> Components: parquet-mr
> Reporter: Anthony Pessy
> Priority: Major
> Attachments: NoOpStatistics.java,
> add_config_to_opt-out_of_a_column's_statistics.patch
>
>
> When you write dataset with BINARY columns that can be fairly large (several
> Mbs) you can often end with an OutOfMemory error where you either have to:
>
> - Throw more RAM
> - Increase number of output files
> - Play with Block size
>
> Using a fork with increased checks frequency for row group size help but it
> is not enough. (PR: [https://github.com/apache/parquet-mr/pull/470])
>
>
> The OutOfMemory error is now caused due to the accumulation of min/max values
> for those columns for each BlockMetaData.
>
> The "parquet.statistics.truncate.length" configuration is of no help because
> it is applied during the footer serialization whereas the OOM occurs before
> that.
>
> I think it would be nice to have, like for dictionary or bloom filter, a way
> to disable the statistic on a per-column basis.
>
> Could be very useful to lower memory consumption when stats of huge binary
> column are unnecessary.
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)