[
https://issues.apache.org/jira/browse/PARQUET-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17815601#comment-17815601
]
Ence Wang commented on PARQUET-2424:
------------------------------------
If we use `min(dict_encoded_size, fallback_to_plain_encoded_size)` for each
page limit check, it should work fine when no fallback happens.
But if the fallback actually happens, it will bring the risk of OOM, because
the values encoded with dict will be re-encoded to plain, and the in-memory
buffer might expand significantly. That's why the current design choose to
over-estimated the page size, which is a preventive strategy to avoid OOM when
fallback happens.
To solve this issue completely, I think we need to redesign the current
fallback mechanism, to estimate the page size precisely while getting rid of
the OOM risk.
I will try to find some quick-fix first to avoid this error without user
awareness.
> Encrypted parquet files can't have more than 32767 pages per chunk: 32768
> -------------------------------------------------------------------------
>
> Key: PARQUET-2424
> URL: https://issues.apache.org/jira/browse/PARQUET-2424
> Project: Parquet
> Issue Type: Bug
> Affects Versions: 1.13.1
> Reporter: Ence Wang
> Priority: Major
> Attachments: image-2024-02-04-19-21-41-207.png, reproduce.zip
>
>
> When we were writing an encrypted file, we encountered the following error:
> {code:java}
> Encrypted parquet files can't have more than 32767 pages per chunk: 32768
> {code}
>
> *Error Stack:*
> {code:java}
> org.apache.parquet.crypto.ParquetCryptoRuntimeException: Encrypted parquet
> files can't have more than 32767 pages per chunk: 32768
> at
> org.apache.parquet.crypto.AesCipher.quickUpdatePageAAD(AesCipher.java:131)
> at
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:178)
> at
> org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:67)
> at
> org.apache.parquet.column.impl.ColumnWriterBase.writePage(ColumnWriterBase.java:392)
> at
> org.apache.parquet.column.impl.ColumnWriteStoreBase.sizeCheck(ColumnWriteStoreBase.java:231)
> at
> org.apache.parquet.column.impl.ColumnWriteStoreBase.endRecord(ColumnWriteStoreBase.java:216)
> at
> org.apache.parquet.column.impl.ColumnWriteStoreV1.endRecord(ColumnWriteStoreV1.java:29)
> at
> org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.endMessage(MessageColumnIO.java:295){code}
>
> *Reasons:*
> The `getBufferedSize` method of
> [FallbackValuesWriter|https://github.com/apache/parquet-mr/blob/19f284355847696fa254c789ab93c42db9af5982/parquet-column/src/main/java/org/apache/parquet/column/values/fallback/FallbackValuesWriter.java#L73]
> returns raw data size to decide if we want to flush the page,
> so the actual size of the page written could be much more smaller due to
> dictionary encoding. This prevents page being too big when fallback happens,
> but can also produce too many pages in a single column chunk. On the other
> side, the encryption module only supports up to 32767 pages per chunk, as we
> use `Short` to store page ordinal as a part of
> [AAD|https://github.com/apache/parquet-format/blob/master/Encryption.md#442-aad-suffix].
>
>
> *Reproduce:*
> *[^reproduce.zip]*
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]