[
https://issues.apache.org/jira/browse/PARQUET-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17813715#comment-17813715
]
Gidon Gershinsky commented on PARQUET-2424:
-------------------------------------------
Yep, makes sense.
The encryption performance is dependent on the page size. It runs the fastest
with 100KB or larger pages, at a few gigabytes / sec speeds. When a page is a
few dozen bytes, the encryption throughput drops to something like 70 megabytes
/ sec - two orders of magnitude slower.
Additionally, there are size implications, unrelated to encryption. Each page
has a page header, which is also a few dozen bytes. Plus some column index
metadata. So using very small pages basically means doubling the file size.
A support of 100K+ pages per column chunk would require changing the Parquet
format specification, and updating the code in multiple implementations. Not
impossible, but still challenging. Given the performance issues, triggered by
using this number of pages, I think a better course of action would be to
configure the workload to create larger / fewer pages.
> Encrypted parquet files can't have more than 32767 pages per chunk: 32768
> -------------------------------------------------------------------------
>
> Key: PARQUET-2424
> URL: https://issues.apache.org/jira/browse/PARQUET-2424
> Project: Parquet
> Issue Type: Bug
> Affects Versions: 1.13.1
> Reporter: Ence Wang
> Priority: Major
> Attachments: reproduce.zip
>
>
> When we were writing an encrypted file, we encountered the following error:
> {code:java}
> Encrypted parquet files can't have more than 32767 pages per chunk: 32768
> {code}
>
> *Error Stack:*
> {code:java}
> org.apache.parquet.crypto.ParquetCryptoRuntimeException: Encrypted parquet
> files can't have more than 32767 pages per chunk: 32768
> at
> org.apache.parquet.crypto.AesCipher.quickUpdatePageAAD(AesCipher.java:131)
> at
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:178)
> at
> org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:67)
> at
> org.apache.parquet.column.impl.ColumnWriterBase.writePage(ColumnWriterBase.java:392)
> at
> org.apache.parquet.column.impl.ColumnWriteStoreBase.sizeCheck(ColumnWriteStoreBase.java:231)
> at
> org.apache.parquet.column.impl.ColumnWriteStoreBase.endRecord(ColumnWriteStoreBase.java:216)
> at
> org.apache.parquet.column.impl.ColumnWriteStoreV1.endRecord(ColumnWriteStoreV1.java:29)
> at
> org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.endMessage(MessageColumnIO.java:295){code}
>
> *Reasons:*
> The `getBufferedSize` method of
> [FallbackValuesWriter|https://github.com/apache/parquet-mr/blob/19f284355847696fa254c789ab93c42db9af5982/parquet-column/src/main/java/org/apache/parquet/column/values/fallback/FallbackValuesWriter.java#L73]
> returns raw data size to decide if we want to flush the page,
> so the actual size of the page written could be much more smaller due to
> dictionary encoding. This prevents page being too big when fallback happens,
> but can also produce too many pages in a single column chunk. On the other
> side, the encryption module only supports up to 32767 pages per chunk, as we
> use `Short` to store page ordinal as a part of
> [AAD|https://github.com/apache/parquet-format/blob/master/Encryption.md#442-aad-suffix].
>
>
> *Reproduce:*
> *[^reproduce.zip]*
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]