kosiew opened a new pull request, #17637:
URL: https://github.com/apache/datafusion/pull/17637
## Which issue does this PR close?
Partial fix for #16579
This is part of a series smaller PRs to replace #17281
## Rationale for this change
The pruning logic previously used `arrow::compute::cast` directly when
converting statistics arrays into the target field types. This caused
two problems:
1. When the statistics arrived as binary (e.g. `Binary`,
`LargeBinary`, or `BinaryView`) but the target column type was UTF-8
(e.g. `Utf8`, `LargeUtf8`, `Utf8View`), invalid UTF-8 byte sequences
could produce unchecked/invalid string values or require special
handling. Using a safe cast path ensures invalid UTF-8 is converted to
`NULL` rather than producing incorrect values or panics.
2. Struct-typed statistics were not always handled robustly. Nested
struct casting and validation required a number of defensive checks
and helpers to ensure the statistics layout matches the requested
struct layout (including useful errors when incompatible).
This change centralizes casting behaviour using
`datafusion_common::cast_column`, adds special handling for
`BinaryView` to sanitize invalid UTF-8 bytes ahead of casting, and
introduces a set of unit tests that exercise:
* Binary → UTF-8 safe casts (including `BinaryView` special-case
sanitization)
* Nested struct statistic handling
* Several error cases that previously produced unclear failures
These changes make pruning more robust when reading
metadata/statistics from formats such as Parquet and ensure
invalid/inconsistent statistics do not silently propagate incorrect
values.
## What changes are included in this PR?
* Replace `arrow::compute::cast(&array, data_type)` usages in pruning
with `cast_column(&array, stat_field, &cast_options)` so that
DataFusion's centralized casting rules are applied. The cast path
respects `CastOptions` and enables safe casting where appropriate.
* Use `DEFAULT_CAST_OPTIONS` and enable `safe` casting when the source
statistics type is binary (`Binary`, `LargeBinary`, `BinaryView`)
and the requested/target type is a string type (`Utf8`, `LargeUtf8`,
`Utf8View`). This converts invalid UTF-8 byte sequences to `NULL`
instead of producing invalid string values.
* Add `sanitize_binary_array_for_utf8` in
`datafusion/common/src/nested_struct.rs` to pre-process a
`BinaryViewArray` so invalid UTF-8 sequences are explicitly converted
to nulls before casting. For other binary array representations the
existing Arrow safe casts are sufficient and the array is returned
unchanged.
* Add logic to `datafusion/pruning/src/pruning_predicate.rs` to
determine when `safe` casting is required, construct `cast_options`,
and pass them to `cast_column`.
* Add numerous unit tests for pruning:
* `test_build_statistics_large_binary_to_large_utf8_invalid_bytes`
* `test_build_statistics_binary_view_to_utf8_view_invalid_bytes`
* `test_build_statistics_binary_view_to_utf8_view_valid_bytes`
* `test_build_statistics_struct_column`
* `test_build_statistics_invalid_utf8_safe_cast`
* `test_build_statistics_nested_invalid_utf8_safe_cast`
* Struct-related negative tests verifying informative errors for
incompatible layouts, non-struct sources, and inconsistent
lengths.
* Minor refactors and imports across `nested_struct.rs` and
`pruning_predicate.rs` (e.g. `AsArray as _`, `DataType` usage, small
helpers such as `make_struct_required_cols` in tests).
## Are these changes tested?
Yes — this PR adds and updates unit tests under `pruning` tests to
cover both happy-path and error-path behaviours:
* Tests ensure invalid UTF-8 bytes in binary statistics are converted
to `NULL` when casting to string types, including special handling
of `BinaryViewArray`.
* Tests validate nested struct statistics are preserved and cast
correctly to the requested struct layout, and that incompatible
layouts produce clear, descriptive errors.
* Additional tests cover the `mismatched statistics length` error
scenario.
All tests were added to the existing test modules and exercise the
`build_statistics_record_batch` logic used by pruning.
## Are there any user-facing changes?
No direct user-facing API changes are expected. This work affects
internal pruning/statistics conversion logic and unit tests. The
change improves robustness of pruning when reading file-format
statistics and avoids incorrect string values resulting from invalid
UTF-8 bytes in statistics metadata.
If there are consumers of internal `cast` behaviour in custom code
paths, they should prefer `datafusion_common::cast_column` for
consistency with DataFusion's casting semantics.
---
### Notes for reviewers
* Look at `datafusion/common/src/nested_struct.rs` for
`sanitize_binary_array_for_utf8` and the `Utf8View` handling in
`cast_column` branching.
* Review `datafusion/pruning/src/pruning_predicate.rs` changes,
especially the decision logic that toggles `cast_options.safe` when
binary stats are being cast to string types.
* The new tests show examples of both valid and invalid inputs for
`BinaryViewArray` and `BinaryArray` to confirm the intended
behaviour.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]