[GitHub] [arrow] arthurpassos commented on a diff in pull request #35825: GH-32723: [C++][Parquet] Add option to use LARGE* variants of binary types

via GitHub Thu, 15 Jun 2023 07:44:51 -0700


arthurpassos commented on code in PR #35825:
URL: https://github.com/apache/arrow/pull/35825#discussion_r1231127814



##########
cpp/src/parquet/encoding.cc:
##########
@@ -1854,15 +1920,30 @@ void 
DictDecoderImpl<ByteArrayType>::InsertDictionary(::arrow::ArrayBuilder* bui
   PARQUET_THROW_NOT_OK(binary_builder->InsertMemoValues(*arr));
 }
 
-class DictByteArrayDecoderImpl : public DictDecoderImpl<ByteArrayType>,
-                                 virtual public ByteArrayDecoder {
+template <>
+void DictDecoderImpl<LargeByteArrayType>::InsertDictionary(
+    ::arrow::ArrayBuilder* builder) {
+  auto binary_builder = 
checked_cast<::arrow::LargeBinaryDictionary32Builder*>(builder);
+
+  // Make a LargeBinaryArray referencing the internal dictionary data
+  auto arr = std::make_shared<::arrow::LargeBinaryArray>(
+      dictionary_length_, byte_array_offsets_, byte_array_data_);

Review Comment:
   hm, that's true, my bad for confusing `byte_array_data_` with 
`byte_array_offsets_`. 
   
   In this case, it looks like defining an `offset_type` trait might do the 
trick and we wouldn't need to modify the offset_type during "calculation", it 
can still be of type `int32_t` since we are not expecting to go above that. I 
am referring to the below code as "offset calculation":
   
    ```
      int32_t offset = 0;
       uint8_t* bytes_data = byte_array_data_->mutable_data();
       int32_t* bytes_offsets =
           reinterpret_cast<int32_t*>(byte_array_offsets_->mutable_data());
   ```
   
   I believe we are only interested in modifying the storage type / size and on 
`bytes_offsets[i] = offset` the implicit cast will do the trick.
   
   Please correct me if I am wrong



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] arthurpassos commented on a diff in pull request #35825: GH-32723: [C++][Parquet] Add option to use LARGE* variants of binary types

Reply via email to