wgtmac commented on code in PR #35825:
URL: https://github.com/apache/arrow/pull/35825#discussion_r1212865493
##########
cpp/src/parquet/properties.h:
##########
@@ -116,6 +116,12 @@ class PARQUET_EXPORT ReaderProperties {
page_checksum_verification_ = check_crc;
}
+ bool use_binary_large_variants() const { return use_binary_large_variants_; }
Review Comment:
It would be moved to `ArrowReaderProperties` to advise the return type of
arrow binary arrays.
To be clear, `ReaderProperties` controls the behavior the internal parquet
reader (without any arrow dependency) while `ArrowReaderProperties` controls
the arrow-related behaviors around parquet.
##########
cpp/src/arrow/array/builder_dict.h:
##########
@@ -724,6 +724,7 @@ using BinaryDictionaryBuilder =
DictionaryBuilder<BinaryType>;
using StringDictionaryBuilder = DictionaryBuilder<StringType>;
using BinaryDictionary32Builder = Dictionary32Builder<BinaryType>;
using StringDictionary32Builder = Dictionary32Builder<StringType>;
+using LargeBinaryDictionary32Builder = Dictionary32Builder<LargeBinaryType>;
Review Comment:
Does `LargeStringDictionary32Builder` worth being added as well?
##########
cpp/src/parquet/types.h:
##########
@@ -64,8 +64,12 @@ struct Type {
DOUBLE = 5,
BYTE_ARRAY = 6,
FIXED_LEN_BYTE_ARRAY = 7,
+
+ // This parquet type does not actually exist (AFAIK) and is used to
+ // create proper type traits
+ LARGE_BYTE_ARRAY = 8,
Review Comment:
I don't think this is right and the parquet internal physical type should be
left unchanged.
We should always derive the target arrow binary type from the arrow
reader/writer properties.
##########
cpp/src/parquet/arrow/schema_internal.cc:
##########
@@ -200,7 +217,7 @@ Result<std::shared_ptr<ArrowType>> GetArrowType(
case ParquetType::DOUBLE:
return ::arrow::float64();
case ParquetType::BYTE_ARRAY:
- return FromByteArray(logical_type);
+ return use_binary_large_variant ? FromLargeByteArray(logical_type) :
FromByteArray(logical_type);
Review Comment:
You may directly pass `use_binary_large_variant` to `FromByteArray` and
avoid duplicate code.
##########
cpp/src/parquet/types.h:
##########
@@ -588,6 +592,26 @@ inline bool operator!=(const ByteArray& left, const
ByteArray& right) {
return !(left == right);
}
+struct LargeByteArray {
+ LargeByteArray() : len(0), ptr(NULLPTR) {}
+ LargeByteArray(uint64_t len, const uint8_t* ptr) : len(len), ptr(ptr) {}
Review Comment:
Ditto.
All the binary values cannot exceed a length of INT32_MAX in the parquet
specs, so reusing `ByteArray` is sufficient.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]