[GitHub] [arrow] arthurpassos commented on a diff in pull request #35825: GH-32723: [C++][Parquet] Add option to use LARGE* variants of binary types

via GitHub Wed, 14 Jun 2023 06:42:29 -0700


arthurpassos commented on code in PR #35825:
URL: https://github.com/apache/arrow/pull/35825#discussion_r1229642355



##########
cpp/src/parquet/encoding.cc:
##########
@@ -1854,15 +1920,30 @@ void 
DictDecoderImpl<ByteArrayType>::InsertDictionary(::arrow::ArrayBuilder* bui
   PARQUET_THROW_NOT_OK(binary_builder->InsertMemoValues(*arr));
 }
 
-class DictByteArrayDecoderImpl : public DictDecoderImpl<ByteArrayType>,
-                                 virtual public ByteArrayDecoder {
+template <>
+void DictDecoderImpl<LargeByteArrayType>::InsertDictionary(
+    ::arrow::ArrayBuilder* builder) {
+  auto binary_builder = 
checked_cast<::arrow::LargeBinaryDictionary32Builder*>(builder);
+
+  // Make a LargeBinaryArray referencing the internal dictionary data
+  auto arr = std::make_shared<::arrow::LargeBinaryArray>(
+      dictionary_length_, byte_array_offsets_, byte_array_data_);

Review Comment:
   Hm.. This might actually be a problem if I understood it correctly. 
::arrow::LargeBinaryArray uses ::arrow::LargeBinaryType, which defines 
offset_type to be 64 bits. The byte_array_offsets_ object is a buffer, so I 
assume that upon reading there'll be a "blind" cast that assumes offsets are 64 
bit long, but since we are passing a buffer with 32 bit offsets, it'll be a 
problem.
   
   Is that correct?
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] arthurpassos commented on a diff in pull request #35825: GH-32723: [C++][Parquet] Add option to use LARGE* variants of binary types

Reply via email to