[GitHub] [arrow] arthurpassos commented on a diff in pull request #35825: GH-32723: [C++][Parquet] Add option to use LARGE* variants of binary types

via GitHub Fri, 09 Jun 2023 07:44:23 -0700


arthurpassos commented on code in PR #35825:
URL: https://github.com/apache/arrow/pull/35825#discussion_r1224403768



##########
cpp/src/parquet/encoding.cc:
##########
@@ -1484,6 +1521,37 @@ class DictDecoderImpl : public DecoderImpl, virtual 
public DictDecoder<Type> {
   // Perform type-specific initiatialization
   void SetDict(TypedDecoder<Type>* dictionary) override;
 
+  template <typename T = Type,
+            typename = std::enable_if_t<std::is_same_v<T, ByteArrayType> ||
+                                        std::is_same_v<T, LargeByteArrayType>>>
+  void SetByteArrayDict(TypedDecoder<Type>* dictionary) {
+    DecodeDict(dictionary);
+
+    auto dict_values = 
reinterpret_cast<ByteArray*>(dictionary_->mutable_data());
+
+    int total_size = 0;

Review Comment:
   Being honest, I am not very familiar with this code base, so it's hard for 
me to reason about this.
   
   I am assuming this Dictionary structure is similar to a LowCardinality 
implementation, which will make use of offsets in order to avoid storing 
duplicated values. I am also assuming it has a direct connection with the 
DictionaryBuilders discussed in: 
https://github.com/apache/arrow/pull/35825#discussion_r1210569849. If that's 
correct, it seems to me that in a worst case scenario where there are no 
repeated values in a LargeByteArray and it exceeds the limit of `int32`, this 
will be problematic.
   
   I would really appreciate some more light into this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] arthurpassos commented on a diff in pull request #35825: GH-32723: [C++][Parquet] Add option to use LARGE* variants of binary types

Reply via email to