Ashish Singh created PARQUET-1920:
-------------------------------------
Summary: Fix issue with reading parquet files with too large
column chunks
Key: PARQUET-1920
URL: https://issues.apache.org/jira/browse/PARQUET-1920
Project: Parquet
Issue Type: Bug
Components: parquet-mr
Affects Versions: 1.10.1, 1.11.0, 1.10.0, 1.12.0, 1.11.1
Reporter: Ashish Singh
Assignee: Ashish Singh
Fix Parquet writer's memory check while writing highly skewed data.
Parquet uses \{{CapacityByteArrayOutputStream}} to hold column chunks in
memory. This is similar to \{{ByteArrayOutputStream}}, however it avoids
copying of entire data while growing the array. It does so by creating and
maintaining different arrays (slabs). The way the size grows is exponentially
till it nears the configurable max capacity hint, and after that it grows very
slowly. This along with the Parquet's logic to determine when to check if
enough data is in memory to flush to disk, makes it possible for a highly
skewed dataset to make Parquet's write really large column chunk and so row
group, beyond the max expected size (in int) of the row group.
In Parquet 1.10, a change was made to make page size row check frequency
configurable. However, there is a bug in the implementation that is leading to
these configs to not help with memory checks calculation.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)