[
https://issues.apache.org/jira/browse/PARQUET-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17814074#comment-17814074
]
Ence Wang commented on PARQUET-2424:
------------------------------------
Yes, I increased the `parquet.page.size` from 1M to 10M for this case and the
error is gone.
Additionally, I think `FallbackValuesWriter` is flawed to some extent, because
it makes the page size too small just for preventing them being too big in case
of fallback happens.
If we ignore the fallback concerns temporarily, and change the
`FallbackValuesWriter::getBufferedSize` to return the actual encoded size, the
result file is tested to have larger pages and a much smaller file size (192K)
compared to the current one (4.5M).
But If we take fallback into account, it would be difficult to take care of the
both sides under the current writing framework.
> Encrypted parquet files can't have more than 32767 pages per chunk: 32768
> -------------------------------------------------------------------------
>
> Key: PARQUET-2424
> URL: https://issues.apache.org/jira/browse/PARQUET-2424
> Project: Parquet
> Issue Type: Bug
> Affects Versions: 1.13.1
> Reporter: Ence Wang
> Priority: Major
> Attachments: image-2024-02-04-19-21-41-207.png, reproduce.zip
>
>
> When we were writing an encrypted file, we encountered the following error:
> {code:java}
> Encrypted parquet files can't have more than 32767 pages per chunk: 32768
> {code}
>
> *Error Stack:*
> {code:java}
> org.apache.parquet.crypto.ParquetCryptoRuntimeException: Encrypted parquet
> files can't have more than 32767 pages per chunk: 32768
> at
> org.apache.parquet.crypto.AesCipher.quickUpdatePageAAD(AesCipher.java:131)
> at
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:178)
> at
> org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:67)
> at
> org.apache.parquet.column.impl.ColumnWriterBase.writePage(ColumnWriterBase.java:392)
> at
> org.apache.parquet.column.impl.ColumnWriteStoreBase.sizeCheck(ColumnWriteStoreBase.java:231)
> at
> org.apache.parquet.column.impl.ColumnWriteStoreBase.endRecord(ColumnWriteStoreBase.java:216)
> at
> org.apache.parquet.column.impl.ColumnWriteStoreV1.endRecord(ColumnWriteStoreV1.java:29)
> at
> org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.endMessage(MessageColumnIO.java:295){code}
>
> *Reasons:*
> The `getBufferedSize` method of
> [FallbackValuesWriter|https://github.com/apache/parquet-mr/blob/19f284355847696fa254c789ab93c42db9af5982/parquet-column/src/main/java/org/apache/parquet/column/values/fallback/FallbackValuesWriter.java#L73]
> returns raw data size to decide if we want to flush the page,
> so the actual size of the page written could be much more smaller due to
> dictionary encoding. This prevents page being too big when fallback happens,
> but can also produce too many pages in a single column chunk. On the other
> side, the encryption module only supports up to 32767 pages per chunk, as we
> use `Short` to store page ordinal as a part of
> [AAD|https://github.com/apache/parquet-format/blob/master/Encryption.md#442-aad-suffix].
>
>
> *Reproduce:*
> *[^reproduce.zip]*
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]