View types directly [arrow]

via GitHub Wed, 21 May 2025 09:32:44 -0700


paleolimbot commented on code in PR #46532:
URL: https://github.com/apache/arrow/pull/46532#discussion_r2100716423



##########
cpp/src/parquet/properties.h:
##########
@@ -1032,6 +1035,18 @@ class PARQUET_EXPORT ArrowReaderProperties {
     }
   }
 
+  /// \brief Set the Arrow binary type to read BYTE_ARRAY columns as.
+  ///
+  /// Allowed values are Type::BINARY, Type::LARGE_BINARY and 
Type::BINARY_VIEW.
+  /// Default is Type::BINARY.
+  ///
+  /// If a serialized Arrow schema is found in the Parquet metadata,
+  /// this setting is ignored and the Arrow schema takes precedence
+  /// (see ArrowWriterProperties::store_schema).
+  void set_binary_type(::arrow::Type::type value) { binary_type_ = value; }

Review Comment:
   Not necessarily for this PR, but I wonder if just specifying a destination 
schema (i.e., `read_parquet(..., schema = ...)`) would help avoid further 
proliferation of options related to output data types (or maybe this already 
exists).



##########
python/pyarrow/tests/parquet/test_data_types.py:
##########
@@ -387,6 +387,25 @@ def test_fixed_size_binary():
     _check_roundtrip(table)
 
 
+def test_binary_types():

Review Comment:
   Do you also need a test for `ParquetFile()` to ensure the option is passed 
there, too?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] GH-43041: [C++][Python] Read/write Parquet BYTE_ARRAY as Large/View types directly [arrow]

Reply via email to