[
https://issues.apache.org/jira/browse/PARQUET-1920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17209904#comment-17209904
]
ASF GitHub Bot commented on PARQUET-1920:
-----------------------------------------
SinghAsDev commented on pull request #824:
URL: https://github.com/apache/parquet-mr/pull/824#issuecomment-705226533
Hey @gszadovszky @shangxinli what do you guys think of this?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Fix issue with reading parquet files with too large column chunks
> -----------------------------------------------------------------
>
> Key: PARQUET-1920
> URL: https://issues.apache.org/jira/browse/PARQUET-1920
> Project: Parquet
> Issue Type: Bug
> Components: parquet-mr
> Affects Versions: 1.10.0, 1.11.0, 1.10.1, 1.12.0, 1.11.1
> Reporter: Ashish Singh
> Assignee: Ashish Singh
> Priority: Major
>
> Fix Parquet writer's memory check while writing highly skewed data.
> Parquet uses \{{CapacityByteArrayOutputStream}} to hold column chunks in
> memory. This is similar to \{{ByteArrayOutputStream}}, however it avoids
> copying of entire data while growing the array. It does so by creating and
> maintaining different arrays (slabs). The way the size grows is exponentially
> till it nears the configurable max capacity hint, and after that it grows
> very slowly. This along with the Parquet's logic to determine when to check
> if enough data is in memory to flush to disk, makes it possible for a highly
> skewed dataset to make Parquet's write really large column chunk and so row
> group, beyond the max expected size (in int) of the row group.
> In Parquet 1.10, a change was made to make page size row check frequency
> configurable. However, there is a bug in the implementation that is leading
> to these configs to not help with memory checks calculation.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)