arthurpassos commented on code in PR #35825:
URL: https://github.com/apache/arrow/pull/35825#discussion_r1224403768
##########
cpp/src/parquet/encoding.cc:
##########
@@ -1484,6 +1521,37 @@ class DictDecoderImpl : public DecoderImpl, virtual
public DictDecoder<Type> {
// Perform type-specific initiatialization
void SetDict(TypedDecoder<Type>* dictionary) override;
+ template <typename T = Type,
+ typename = std::enable_if_t<std::is_same_v<T, ByteArrayType> ||
+ std::is_same_v<T, LargeByteArrayType>>>
+ void SetByteArrayDict(TypedDecoder<Type>* dictionary) {
+ DecodeDict(dictionary);
+
+ auto dict_values =
reinterpret_cast<ByteArray*>(dictionary_->mutable_data());
+
+ int total_size = 0;
Review Comment:
Being honest, I am not very familiar with this code base, so it's hard for
me to reason about this.
I am assuming this Dictionary structure is similar to a LowCardinality
implementation, which will make use of offsets in order to avoid storing
duplicated values. I am also assuming it has a direct connection with the
DictionaryBuilders discussed in:
https://github.com/apache/arrow/pull/35825#discussion_r1210569849. If that's
correct, it seems to me that in a worst case scenario where there are no
repeated values in a LargeByteArray and it exceeds the limit of `int32`, this
will be problematic.
I would really appreciate some more light into this.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]