[GitHub] [arrow] tachyonwill commented on a change in pull request #12274: PARQUET-2115: [C++] Parquet dictionary bit widths are limited to 32 bits


tachyonwill commented on a change in pull request #12274:
URL: https://github.com/apache/arrow/pull/12274#discussion_r793135042




##########
File path: cpp/src/parquet/encoding.cc
##########
@@ -1486,7 +1486,7 @@ class DictDecoderImpl : public DecoderImpl, virtual 
public DictDecoder<Type> {
       return;
     }
     uint8_t bit_width = *data;
-    if (ARROW_PREDICT_FALSE(bit_width >= 64)) {
+    if (ARROW_PREDICT_FALSE(bit_width > 32)) {
       throw ParquetException("Invalid or corrupted bit_width");

Review comment:
       I think this restriction is somewhat separate from the dictionary 
specific bitwidth restriction. The dictionary bit width restriction has been 
there since at least version 2.2 in 2013: 
https://github.com/apache/parquet-format/commit/ad2e4c438cdf080bf042a5330965e2eefb0caa65
 . A bit width > 32 bits would also not be compatible with the num_values field 
in the header: 
https://github.com/apache/parquet-format/blob/54e53e5d7794d383529dd30746378f19a12afd58/src/main/thrift/parquet.thrift#L544
   
   Also, parquet cpp uses int32_t internal for indices, so to support higher 
bitwidths would require a refactor (ex:   
https://github.com/apache/arrow/blob/01855c791056b7f712e6df82acf97ad3ab7b823a/cpp/src/parquet/encoding.cc#L1582
 )




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] tachyonwill commented on a change in pull request #12274: PARQUET-2115: [C++] Parquet dictionary bit widths are limited to 32 bits

Reply via email to