[
https://issues.apache.org/jira/browse/ARROW-17465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17582181#comment-17582181
]
Antoine Pitrou commented on ARROW-17465:
----------------------------------------
Hmm, it seems it will be complicated for Parquet C++ to accept this without
some rearchitecturing of the decoder.
(roughly, if we decode Int32 values, we first decode the deltas as a buffer of
Int32 as well...)
I would suggest first ask on the Parquet ML whether this is intended to be
supported.
(note that of course for such data, DELTA_BINARY_PACKED should not be used at
all, as it produces an expansion... while being more CPU-intensive to decode as
well)
> [Parquet] DELTA_BINARY_PACKED constraint on num_bits is too restrict?
> ---------------------------------------------------------------------
>
> Key: ARROW-17465
> URL: https://issues.apache.org/jira/browse/ARROW-17465
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, Parquet
> Reporter: Jorge Leitão
> Priority: Major
> Attachments: test.parquet
>
>
> Consider the sequence of (int32) values
> [863490391,-816295192,1613070492,-1166045478,1856530847]
> This sequence can be encoded as a single block, single miniblock with a
> bit_width of 33.
> However, we currently require [1] the bit_width of each miniblock to be
> smaller than the bitwidth of the type it encodes.
> We could consider lifting this constraint, as, as shown in the example above,
> the values representation's `bit_width` can be smaller than the delta's
> representation's `bit_width`.
> [1]
> https://github.com/apache/arrow/blob/a376968089d7310f4a88d054822fa1eaf96c46f5/cpp/src/parquet/encoding.cc#L2173
--
This message was sent by Atlassian Jira
(v8.20.10#820010)