diegoQuinas opened a new pull request, #21728:
URL: https://github.com/apache/datafusion/pull/21728
## Which issue does this PR close?
- Closes #21496.
## Rationale for this change
`DataFrame::describe()` is the standard way to get a statistical summary of
a DataFrame (count, null_count, mean, std, min, max, median per column). Today
it handles binary-like columns poorly:
- For `Binary`, an exclusion filter in `min`/`max` aggregations caused both
to be reported as `null`, losing useful information for columns that hold
hashes, UUIDs, fingerprints, or other content-addressed identifiers.
- For `LargeBinary`, `BinaryView`, and `FixedSizeBinary`, the filter did not
apply, so `min`/`max` ran successfully but then the display step tried to
`cast(column, Utf8)`, which Arrow correctly rejects, producing an
`ArrowError::CastError` that bubbled up and failed the whole `describe()` call.
The fix in this PR is aligned with what the issue proposes: stop filtering
`Binary` from the aggregations and render binary outputs as lowercase hex
(matching Arrow's default display of binary arrays).
## What changes are included in this PR?
- `datafusion/core/src/dataframe/mod.rs`:
- Drop `DataType::Binary` from the `min`/`max` exclusion filter (now only
`Boolean` is excluded, which is still meaningful for a statistical summary).
- Add a dedicated display branch for `Binary`, `LargeBinary`,
`BinaryView`, and `FixedSizeBinary` that uses
`arrow::util::display::ArrayFormatter` with default options, which renders
bytes as lowercase hex.
- Tidy a now-stale comment that referenced the previous binary filter.
- Drive-by: use the newly imported `FormatOptions` unqualified in
`DataFrame::to_string()` for consistency.
## Are these changes tested?
Yes, a new integration test `describe_binary_columns` in
`datafusion/core/tests/dataframe/describe.rs` builds an in-memory `RecordBatch`
with one column per binary-like type and asserts the full `describe()` output
via an inline `insta` snapshot. The test covers non-null values and a null row
per column, so it exercises both `null_count` and the hex rendering path for
`min`/`max`.
All existing `describe` tests continue to pass unchanged.
## Are there any user-facing changes?
Yes — this is a visible behavior change for `DataFrame::describe()`:
- Before: `min`/`max` on `Binary` columns were `null`; other binary-like
types caused a cast error.
- After: `min`/`max` on all binary-like types render as lowercase hex
strings (e.g. `"0001"`, `"ffee"`).
No public API changes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]