[ 
https://issues.apache.org/jira/browse/PARQUET-1920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-1920.
---------------------------------------
    Resolution: Fixed

> Fix issue with reading parquet files with too large column chunks
> -----------------------------------------------------------------
>
>                 Key: PARQUET-1920
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1920
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-mr
>    Affects Versions: 1.10.0, 1.11.0, 1.10.1, 1.12.0, 1.11.1
>            Reporter: Ashish Singh
>            Assignee: Ashish Singh
>            Priority: Major
>
> Fix Parquet writer's memory check while writing highly skewed data.
> Parquet uses \{{CapacityByteArrayOutputStream}} to hold column chunks in 
> memory. This is similar to \{{ByteArrayOutputStream}}, however it avoids 
> copying of entire data while growing the array. It does so by creating and 
> maintaining different arrays (slabs). The way the size grows is exponentially 
> till it nears the configurable max capacity hint, and after that it grows 
> very slowly. This along with the Parquet's logic to determine when to check 
> if enough data is in memory to flush to disk, makes it possible for a highly 
> skewed dataset to make Parquet's write really large column chunk and so row 
> group, beyond the max expected size (in int) of the row group.
> In Parquet 1.10, a change was made to make page size row check frequency 
> configurable. However, there is a bug in the implementation that is leading 
> to these configs to not help with memory checks calculation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to