kosiew opened a new pull request, #17637:
URL: https://github.com/apache/datafusion/pull/17637

   
   ## Which issue does this PR close?
   
   Partial fix for #16579
   
   This is part of a series smaller PRs to replace #17281
   
   ## Rationale for this change
   
   The pruning logic previously used `arrow::compute::cast` directly when
   converting statistics arrays into the target field types. This caused
   two problems:
   
   1. When the statistics arrived as binary (e.g. `Binary`,
   `LargeBinary`, or `BinaryView`) but the target column type was UTF-8
   (e.g. `Utf8`, `LargeUtf8`, `Utf8View`), invalid UTF-8 byte sequences
   could produce unchecked/invalid string values or require special
   handling. Using a safe cast path ensures invalid UTF-8 is converted to
   `NULL` rather than producing incorrect values or panics.
   
   2. Struct-typed statistics were not always handled robustly. Nested
   struct casting and validation required a number of defensive checks
   and helpers to ensure the statistics layout matches the requested
   struct layout (including useful errors when incompatible).
   
   This change centralizes casting behaviour using
   `datafusion_common::cast_column`, adds special handling for
   `BinaryView` to sanitize invalid UTF-8 bytes ahead of casting, and
   introduces a set of unit tests that exercise:
   
   * Binary → UTF-8 safe casts (including `BinaryView` special-case
     sanitization)
   * Nested struct statistic handling
   * Several error cases that previously produced unclear failures
   
   These changes make pruning more robust when reading
   metadata/statistics from formats such as Parquet and ensure
   invalid/inconsistent statistics do not silently propagate incorrect
   values.
   
   ## What changes are included in this PR?
   
   * Replace `arrow::compute::cast(&array, data_type)` usages in pruning
     with `cast_column(&array, stat_field, &cast_options)` so that
   DataFusion's centralized casting rules are applied. The cast path
   respects `CastOptions` and enables safe casting where appropriate.
   
   * Use `DEFAULT_CAST_OPTIONS` and enable `safe` casting when the source
     statistics type is binary (`Binary`, `LargeBinary`, `BinaryView`)
   and the requested/target type is a string type (`Utf8`, `LargeUtf8`,
   `Utf8View`). This converts invalid UTF-8 byte sequences to `NULL`
   instead of producing invalid string values.
   
   * Add `sanitize_binary_array_for_utf8` in
     `datafusion/common/src/nested_struct.rs` to pre-process a
   `BinaryViewArray` so invalid UTF-8 sequences are explicitly converted
   to nulls before casting. For other binary array representations the
   existing Arrow safe casts are sufficient and the array is returned
   unchanged.
   
   * Add logic to `datafusion/pruning/src/pruning_predicate.rs` to
     determine when `safe` casting is required, construct `cast_options`,
   and pass them to `cast_column`.
   
   * Add numerous unit tests for pruning:
   
     * `test_build_statistics_large_binary_to_large_utf8_invalid_bytes`
     * `test_build_statistics_binary_view_to_utf8_view_invalid_bytes`
     * `test_build_statistics_binary_view_to_utf8_view_valid_bytes`
     * `test_build_statistics_struct_column`
     * `test_build_statistics_invalid_utf8_safe_cast`
     * `test_build_statistics_nested_invalid_utf8_safe_cast`
     * Struct-related negative tests verifying informative errors for
       incompatible layouts, non-struct sources, and inconsistent
   lengths.
   
   * Minor refactors and imports across `nested_struct.rs` and
     `pruning_predicate.rs` (e.g. `AsArray as _`, `DataType` usage, small
   helpers such as `make_struct_required_cols` in tests).
   
   ## Are these changes tested?
   
   Yes — this PR adds and updates unit tests under `pruning` tests to
   cover both happy-path and error-path behaviours:
   
   * Tests ensure invalid UTF-8 bytes in binary statistics are converted
     to `NULL` when casting to string types, including special handling
   of `BinaryViewArray`.
   * Tests validate nested struct statistics are preserved and cast
     correctly to the requested struct layout, and that incompatible
   layouts produce clear, descriptive errors.
   * Additional tests cover the `mismatched statistics length` error
     scenario.
   
   All tests were added to the existing test modules and exercise the
   `build_statistics_record_batch` logic used by pruning.
   
   ## Are there any user-facing changes?
   
   No direct user-facing API changes are expected. This work affects
   internal pruning/statistics conversion logic and unit tests. The
   change improves robustness of pruning when reading file-format
   statistics and avoids incorrect string values resulting from invalid
   UTF-8 bytes in statistics metadata.
   
   If there are consumers of internal `cast` behaviour in custom code
   paths, they should prefer `datafusion_common::cast_column` for
   consistency with DataFusion's casting semantics.
   
   ---
   
   ### Notes for reviewers
   
   * Look at `datafusion/common/src/nested_struct.rs` for
     `sanitize_binary_array_for_utf8` and the `Utf8View` handling in
   `cast_column` branching.
   * Review `datafusion/pruning/src/pruning_predicate.rs` changes,
     especially the decision logic that toggles `cast_options.safe` when
   binary stats are being cast to string types.
   * The new tests show examples of both valid and invalid inputs for
     `BinaryViewArray` and `BinaryArray` to confirm the intended
   behaviour.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to