Ashish Singh created PARQUET-1920:
-------------------------------------

             Summary: Fix issue with reading parquet files with too large 
column chunks
                 Key: PARQUET-1920
                 URL: https://issues.apache.org/jira/browse/PARQUET-1920
             Project: Parquet
          Issue Type: Bug
          Components: parquet-mr
    Affects Versions: 1.10.1, 1.11.0, 1.10.0, 1.12.0, 1.11.1
            Reporter: Ashish Singh
            Assignee: Ashish Singh


Fix Parquet writer's memory check while writing highly skewed data.

Parquet uses \{{CapacityByteArrayOutputStream}} to hold column chunks in 
memory. This is similar to \{{ByteArrayOutputStream}}, however it avoids 
copying of entire data while growing the array. It does so by creating and 
maintaining different arrays (slabs). The way the size grows is exponentially 
till it nears the configurable max capacity hint, and after that it grows very 
slowly. This along with the Parquet's logic to determine when to check if 
enough data is in memory to flush to disk, makes it possible for a highly 
skewed dataset to make Parquet's write really large column chunk and so row 
group, beyond the max expected size (in int) of the row group.

In Parquet 1.10, a change was made to make page size row check frequency 
configurable. However, there is a bug in the implementation that is leading to 
these configs to not help with memory checks calculation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to