[
https://issues.apache.org/jira/browse/PARQUET-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17813652#comment-17813652
]
Ence Wang commented on PARQUET-2424:
------------------------------------
[~gershinsky] For my case, 64K is still not sufficient, there are 102K pages in
a single column chunk.
{code:java}
# reproduce.zip/test-input.parquet
row group 0
--------------------------------------------------------------------------------
task_log: BINARY UNCOMPRESSED DO:4 FPO:30669 SZ:3399502/3399502/1.00
[more]... ST:[no stats for this column] task_log TV=10208503 RL=0 DL=1 DS: 2
DE:PLAIN_DICTIONARY
----------------------------------------------------------------------------
page 0: DLE:RLE RLE:BIT_PACKED VLE:PLA
[more]... CRC:[verified] VC:100
page 1: DLE:RLE RLE:BIT_PACKED VLE:PLA
[more]... CRC:[verified] VC:100
page 2: DLE:RLE RLE:BIT_PACKED VLE:PLA
[more]... CRC:[verified] VC:100
page 3: DLE:RLE RLE:BIT_PACKED VLE:PLA
[more]... CRC:[verified] VC:100
....
page 102080: DLE:RLE RLE:BIT_PACKED VLE:PLA
[more]... CRC:[verified] VC:100
page 102081: DLE:RLE RLE:BIT_PACKED VLE:PLA
[more]... CRC:[verified] VC:100
page 102082: DLE:RLE RLE:BIT_PACKED VLE:PLA
[more]... CRC:[verified] VC:100
page 102083: DLE:RLE RLE:BIT_PACKED VLE:PLA
[more]... CRC:[verified] VC:100
page 102084: DLE:RLE RLE:BIT_PACKED VLE:PLA
[more]... CRC:[verified] VC:100
page 102085: DLE:RLE RLE:BIT_PACKED VLE:PLA
[more]... CRC:[verified] VC:3 {code}
> Encrypted parquet files can't have more than 32767 pages per chunk: 32768
> -------------------------------------------------------------------------
>
> Key: PARQUET-2424
> URL: https://issues.apache.org/jira/browse/PARQUET-2424
> Project: Parquet
> Issue Type: Bug
> Affects Versions: 1.13.1
> Reporter: Ence Wang
> Priority: Major
> Attachments: reproduce.zip
>
>
> When we were writing an encrypted file, we encountered the following error:
> {code:java}
> Encrypted parquet files can't have more than 32767 pages per chunk: 32768
> {code}
>
> *Error Stack:*
> {code:java}
> org.apache.parquet.crypto.ParquetCryptoRuntimeException: Encrypted parquet
> files can't have more than 32767 pages per chunk: 32768
> at
> org.apache.parquet.crypto.AesCipher.quickUpdatePageAAD(AesCipher.java:131)
> at
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:178)
> at
> org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:67)
> at
> org.apache.parquet.column.impl.ColumnWriterBase.writePage(ColumnWriterBase.java:392)
> at
> org.apache.parquet.column.impl.ColumnWriteStoreBase.sizeCheck(ColumnWriteStoreBase.java:231)
> at
> org.apache.parquet.column.impl.ColumnWriteStoreBase.endRecord(ColumnWriteStoreBase.java:216)
> at
> org.apache.parquet.column.impl.ColumnWriteStoreV1.endRecord(ColumnWriteStoreV1.java:29)
> at
> org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.endMessage(MessageColumnIO.java:295){code}
>
> *Reasons:*
> The `getBufferedSize` method of
> [FallbackValuesWriter|https://github.com/apache/parquet-mr/blob/19f284355847696fa254c789ab93c42db9af5982/parquet-column/src/main/java/org/apache/parquet/column/values/fallback/FallbackValuesWriter.java#L73]
> returns raw data size to decide if we want to flush the page,
> so the actual size of the page written could be much more smaller due to
> dictionary encoding. This prevents page being too big when fallback happens,
> but can also produce too many pages in a single column chunk. On the other
> side, the encryption module only supports up to 32767 pages per chunk, as we
> use `Short` to store page ordinal as a part of
> [AAD|https://github.com/apache/parquet-format/blob/master/Encryption.md#442-aad-suffix].
>
>
> *Reproduce:*
> *[^reproduce.zip]*
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]