kosiew opened a new issue, #22687:
URL: https://github.com/apache/datafusion/issues/22687
## Summary
Scalar distance currently returns Option<usize>, but interval cardinality
semantics are defined in terms of u64 domain counts. This introduces
architecture-coupled behavior for large numeric ranges and requires downstream
casting when a u64 semantic result is expected.
## Problem Statement
The distance API in
[datafusion/common/src/scalar/mod.rs](datafusion/common/src/scalar/mod.rs#L256)
uses usize as its numeric result type. usize is machine-width dependent and
represents memory/index sizing, not value-domain cardinality.
A key consumer,
[datafusion/expr-common/src/interval_arithmetic.rs](datafusion/expr-common/src/interval_arithmetic.rs#L916),
computes interval cardinality by converting distance to u64 and then applying
inclusive arithmetic. This cross-type flow can obscure boundary behavior and
makes reasoning about full-domain ranges harder than necessary.
## Why This Matters
- Correctness clarity: Cardinality is a value-domain concept and should be
expressed in a stable domain-sized type.
- Platform consistency: usize-based overflow/None behavior can differ across
targets.
- Maintainability: Callers that need cardinality semantics should not need
ad hoc conversions from usize.
- Boundary safety: Full-range integer and temporal cases are easier to
reason about when distance and cardinality share an explicit overflow-aware
contract.
## Current Behavior
- Scalar distance returns Option<usize> for integer, temporal, and
float-derived distances at
[datafusion/common/src/scalar/mod.rs](datafusion/common/src/scalar/mod.rs#L2566).
- Interval cardinality maps distance to u64 and then uses checked_add(1) in
[datafusion/expr-common/src/interval_arithmetic.rs](datafusion/expr-common/src/interval_arithmetic.rs#L908).
- Stats overlap estimation also uses distance directly as usize at
[datafusion/common/src/stats.rs](datafusion/common/src/stats.rs#L826).
## Desired Behavior
Distance calculations used for value-domain semantics should expose an
overflow-aware u64-oriented contract (or a dedicated domain type), so
large-range behavior is explicit and target-independent.
## Scope
In scope for this issue:
- Introduce either:
- A revised scalar distance API that returns Option<u64>, or
- A dedicated helper for domain cardinality distance (for example, for
interval/statistics semantics) that returns Option<u64>.
- Update key consumers (interval cardinality and stats overlap paths) to use
the new contract.
- Preserve existing user-visible behavior where representable, and make
overflow-to-None rules explicit.
## Proposed Approach
1. Add a domain-oriented distance function in
[datafusion/common/src/scalar/mod.rs](datafusion/common/src/scalar/mod.rs).
2. Keep existing usize API temporarily (if needed) as a compatibility shim
and migrate internal callers.
3. Update interval cardinality implementation in
[datafusion/expr-common/src/interval_arithmetic.rs](datafusion/expr-common/src/interval_arithmetic.rs).
4. Evaluate and update stats overlap estimation in
[datafusion/common/src/stats.rs](datafusion/common/src/stats.rs).
5. Add boundary tests for full-domain integer/temporal ranges and overflow
edges.
6. Remove or deprecate the usize-based path after call-site migration (as
appropriate for semver policy).
## Acceptance Criteria
- Distance semantics for cardinality-sensitive paths are represented with an
overflow-aware u64 contract.
- No behavior regression for currently representable ranges.
- Explicit tests cover:
- Full signed range edge.
- Full unsigned range edge.
- Large temporal range edge.
- Overflow-to-None behavior.
- Interval cardinality path no longer depends on usize-to-u64 conversion for
domain semantics.
- Stats overlap path documents and tests its intended behavior under large
ranges.
## Test Plan
- Unit tests in scalar distance module for boundary values.
- Unit tests in interval arithmetic for cardinality boundaries.
- Unit tests in stats overlap estimation for large range handling.
- Run targeted suites:
- cargo test -p datafusion-common
- cargo test -p datafusion-expr-common
## Related PR
#22309
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]