tachyonwill commented on a change in pull request #12274:
URL: https://github.com/apache/arrow/pull/12274#discussion_r793135042
##########
File path: cpp/src/parquet/encoding.cc
##########
@@ -1486,7 +1486,7 @@ class DictDecoderImpl : public DecoderImpl, virtual
public DictDecoder<Type> {
return;
}
uint8_t bit_width = *data;
- if (ARROW_PREDICT_FALSE(bit_width >= 64)) {
+ if (ARROW_PREDICT_FALSE(bit_width > 32)) {
throw ParquetException("Invalid or corrupted bit_width");
Review comment:
I think this restriction is somewhat separate from the dictionary
specific bitwidth restriction. The dictionary bit width restriction has been
there since at least version 2.2 in 2013:
https://github.com/apache/parquet-format/commit/ad2e4c438cdf080bf042a5330965e2eefb0caa65
. A bit width > 32 bits would also not be compatible with the num_values field
in the header:
https://github.com/apache/parquet-format/blob/54e53e5d7794d383529dd30746378f19a12afd58/src/main/thrift/parquet.thrift#L544
Also, parquet cpp uses int32_t internal for indices, so to support higher
bitwidths would require a refactor (ex:
https://github.com/apache/arrow/blob/01855c791056b7f712e6df82acf97ad3ab7b823a/cpp/src/parquet/encoding.cc#L1582
)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]