[
https://issues.apache.org/jira/browse/PARQUET-1920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gabor Szadovszky resolved PARQUET-1920.
---------------------------------------
Resolution: Fixed
> Fix issue with reading parquet files with too large column chunks
> -----------------------------------------------------------------
>
> Key: PARQUET-1920
> URL: https://issues.apache.org/jira/browse/PARQUET-1920
> Project: Parquet
> Issue Type: Bug
> Components: parquet-mr
> Affects Versions: 1.10.0, 1.11.0, 1.10.1, 1.12.0, 1.11.1
> Reporter: Ashish Singh
> Assignee: Ashish Singh
> Priority: Major
>
> Fix Parquet writer's memory check while writing highly skewed data.
> Parquet uses \{{CapacityByteArrayOutputStream}} to hold column chunks in
> memory. This is similar to \{{ByteArrayOutputStream}}, however it avoids
> copying of entire data while growing the array. It does so by creating and
> maintaining different arrays (slabs). The way the size grows is exponentially
> till it nears the configurable max capacity hint, and after that it grows
> very slowly. This along with the Parquet's logic to determine when to check
> if enough data is in memory to flush to disk, makes it possible for a highly
> skewed dataset to make Parquet's write really large column chunk and so row
> group, beyond the max expected size (in int) of the row group.
> In Parquet 1.10, a change was made to make page size row check frequency
> configurable. However, there is a bug in the implementation that is leading
> to these configs to not help with memory checks calculation.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)