Piyush Narang created PARQUET-624:
-------------------------------------
Summary: Value count used for memSize calculation in
ColumnWriterV1 can be skewed based on first 100 values
Key: PARQUET-624
URL: https://issues.apache.org/jira/browse/PARQUET-624
Project: Parquet
Issue Type: Bug
Components: parquet-mr
Reporter: Piyush Narang
Assignee: Piyush Narang
While digging into some OOMs that we were seeing for some of our Parquet writer
jobs, I noticed that we were writing out around 250MB+ of data for a single
column as one page. Our page size threshold is set to 1MB so this should
actually result in a few hundred pages instead of just 1.
This seems to be due to the code in:
[ColumnWriterV1.accountForValueWritten()|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterV1.java#L93].
We only check if we've crossed the memory threshold if the valueCount exceeds
the valueCountForNextSizeCheck. However, valueCountForNextSizeCheck can end up
getting skewed substantially if the memSize of the first 100 values of the
column is really small:
For example, I see this in one of our jobs:
{code}
[foo_column] valueCount: 101, memSize: 16, pageSizeThreshold: 1048576
valueCountForNextSizeCheck = (int)(valueCount + ((float)valueCount *
props.getPageSizeThreshold() / memSize)) / 2 + 1;
[foo_column] valueCountForNextSizeCheck = 3309619
{code}
This really large new valueCountForNextSizeCheck, results in our job OOMing as
we end up seeing more space consuming values much much earlier than the ~3M
valueCount point.
At this point, I'm thinking of doing something simple which is similar to
[InternalParquetRecordWriter.checkBlockSizeReached()|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordWriter.java#L143],
basically cap the maximum value of the valueCountForNextSizeCheck:
{code}
valueCountForNextSizeCheck =
Math.min(
(int)(valueCount + ((float)valueCount * pageSizeThreshold /
memSize)) / 2 + 1,
valueCount + MAX_COUNT_FOR_SIZE_CHECK // will not look more than
max records ahead
);
{code}
Open to something more sophisticated if people prefer so.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)