[GitHub] [arrow] mapleFU commented on a diff in pull request #35825: GH-32723: [C++][Parquet] Add option to use LARGE* variants of binary types

via GitHub Thu, 15 Jun 2023 08:02:35 -0700


mapleFU commented on code in PR #35825:
URL: https://github.com/apache/arrow/pull/35825#discussion_r1231159943



##########
cpp/src/parquet/encoding.cc:
##########
@@ -1854,15 +1920,30 @@ void 
DictDecoderImpl<ByteArrayType>::InsertDictionary(::arrow::ArrayBuilder* bui
   PARQUET_THROW_NOT_OK(binary_builder->InsertMemoValues(*arr));
 }
 
-class DictByteArrayDecoderImpl : public DictDecoderImpl<ByteArrayType>,
-                                 virtual public ByteArrayDecoder {
+template <>
+void DictDecoderImpl<LargeByteArrayType>::InsertDictionary(
+    ::arrow::ArrayBuilder* builder) {
+  auto binary_builder = 
checked_cast<::arrow::LargeBinaryDictionary32Builder*>(builder);
+
+  // Make a LargeBinaryArray referencing the internal dictionary data
+  auto arr = std::make_shared<::arrow::LargeBinaryArray>(
+      dictionary_length_, byte_array_offsets_, byte_array_data_);

Review Comment:
   I'm not sure I understand what you mean.
   
   > we wouldn't need to modify the offset_type during "calculation"
   
   Maybe. I guess the actual code would better be something like:
   
   1.
   
   ```c++
   offset_t offset = 0;
      uint8_t* bytes_data = byte_array_data_->mutable_data();
      offset_t* bytes_offsets =
          reinterpret_cast<offset_t*>(byte_array_offsets_->mutable_data());
   ```
   
   In this case, `byte_array_offsets_` are allocated with length `(size + 1) * 
sizeof(offset_t)`, and using `offset_t` to write. The underlying data is 
int32_t, so maybe it includes:
   
   ```
   bytes_offsets[idx] (which could an int64_t or int32_t) = len + 
bytes_array[idx].len; (len should be int32_t) 
   ```
   
   And when building, just use `byte_array_offsets_`
   
   2.
   
   Not change the `byte_array_offsets_`. And when building, casting it to a 
int64_t array with length `(size + 1) * sizeof(offset_t)`
   
   
   Personally I perfer (1).
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] mapleFU commented on a diff in pull request #35825: GH-32723: [C++][Parquet] Add option to use LARGE* variants of binary types

Reply via email to