paleolimbot commented on code in PR #46532: URL: https://github.com/apache/arrow/pull/46532#discussion_r2100716423
########## cpp/src/parquet/properties.h: ########## @@ -1032,6 +1035,18 @@ class PARQUET_EXPORT ArrowReaderProperties { } } + /// \brief Set the Arrow binary type to read BYTE_ARRAY columns as. + /// + /// Allowed values are Type::BINARY, Type::LARGE_BINARY and Type::BINARY_VIEW. + /// Default is Type::BINARY. + /// + /// If a serialized Arrow schema is found in the Parquet metadata, + /// this setting is ignored and the Arrow schema takes precedence + /// (see ArrowWriterProperties::store_schema). + void set_binary_type(::arrow::Type::type value) { binary_type_ = value; } Review Comment: Not necessarily for this PR, but I wonder if just specifying a destination schema (i.e., `read_parquet(..., schema = ...)`) would help avoid further proliferation of options related to output data types (or maybe this already exists). ########## python/pyarrow/tests/parquet/test_data_types.py: ########## @@ -387,6 +387,25 @@ def test_fixed_size_binary(): _check_roundtrip(table) +def test_binary_types(): Review Comment: Do you also need a test for `ParquetFile()` to ensure the option is passed there, too? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org