xushiyan opened a new issue, #609:
URL: https://github.com/apache/hudi-rs/issues/609
## Summary
`hudi_core::expr::filter::Filter` represents filter values as `Vec<String>`:
```rust
pub struct Filter {
field: String,
operator: ExprOperator,
values: Vec<String>,
}
```
This forces every typed value (Arrow `ScalarValue`, DataFusion literal,
decimal, timestamp, binary, …) to be stringified at the boundary and parsed
back to a typed `Scalar<ArrayRef>` at apply time via
`arrow_cast::cast_with_options` (`crates/core/src/expr/filter.rs:345`). The
round trip is lossy: any disagreement between the producer's `Display` and
`arrow_cast`'s parser becomes a silent correctness bug — the filter applies but
never matches, returning zero rows.
This proposal: make `Filter::values` typed internally while preserving the
string-tuple API for cross-language callers (Python, SQL `CREATE EXTERNAL
TABLE` options).
## Current state
`Filter` is constructed three ways, all string-flavored:
- `Filter::new(field: String, op: ExprOperator, values: Vec<String>)` —
`crates/core/src/expr/filter.rs:38`
- Builder via `FilterField::eq/ne/lt/...(value: impl Into<String>)` —
`filter.rs:237-285`
- Tuple parser `from_str_tuples` for cross-language entry points —
`filter.rs:156`
At apply time, values are cast back into the column's Arrow `DataType` for
comparison against record batches. The DataFusion integration goes through this
round trip on every pushed-down filter
(`crates/datafusion/src/util/expr.rs:108-220`).
## Bugs this has already produced + lurking
| Type | DataFusion `Display` | `arrow_cast` expects | Status |
|---|---|---|---|
| `Decimal128/256` | unscaled integer (`\"7500\"`) | scaled decimal
(`\"75.00\"`) | **Was broken**; patched in `crates/datafusion/src/util/expr.rs`
(`scalar_to_filter_value`) — required ~50 lines of formatting per direction |
| `Timestamp{Nanosecond,Microsecond,Millisecond,Second}` |
unit/timezone-dependent | parseable form | Likely lurking; no round-trip tests |
| `Binary` / `FixedSizeBinary` | debug-style `[1, 2, 3]` | hex / base64
(unclear) | Almost certainly broken |
| `Float NaN/Inf` | `\"NaN\"`, `\"inf\"` | parses but `NaN = NaN` is false
in SQL | Semantic mismatch |
| `Null` decimals | new helper emits `\"NULL\"` | error or sentinel? |
Undocumented |
| `List` / `Struct` / `Map` | not representable | not representable | Silent
drop |
The decimal patch fixed one corner. The same bug class will recur for
timestamps next time someone pushes down a timestamp predicate.
## Root cause
`Filter` now serves two jobs through one string-typed shape:
1. **Partition pruning** — partition values come from directory names, so
string comparison is natural.
2. **Row-level filtering** — typed comparison against Arrow arrays via
`cast_value`. The string trip is a leaky abstraction.
The collision of these two jobs is the design issue.
## Proposed shape
Introduce an internal `FilterValue` and keep the string-tuple API as a thin
convenience layer.
```rust
#[derive(Clone, Debug, PartialEq)]
pub enum FilterValue {
/// Owned typed scalar — preferred construction path from typed producers
/// (DataFusion, Arrow, internal pruners).
Typed(ScalarValue),
/// Untyped string — for cross-language and SQL option callers. Resolved
/// to a typed scalar at apply time against the column's DataType.
Untyped(String),
}
pub struct Filter {
field: String,
operator: ExprOperator,
values: Vec<FilterValue>,
}
impl Filter {
pub fn new_typed(field: impl Into<String>, op: ExprOperator, values:
Vec<ScalarValue>) -> Result<Self>;
pub fn new(field: String, op: ExprOperator, values: Vec<String>) ->
Result<Self>; // existing — wraps in Untyped
}
```
At apply time (`filter.rs:333+`):
- `FilterValue::Typed(scalar)` → use directly (no parsing, no precision
loss).
- `FilterValue::Untyped(s)` → existing `cast_value(s, data_type)` path.
The DataFusion integration becomes:
```rust
// crates/datafusion/src/util/expr.rs
let value = FilterValue::Typed(literal.clone()); // was: lit.to_string()
```
The new ~50-line decimal formatter (`scalar_to_filter_value`) and similar
workarounds for timestamps/binary can be deleted.
The Python tuple form `(\"col\", \"=\", \"value\")` and SQL `OPTIONS
(filter='city=tokyo')` continue working — they route through
`FilterValue::Untyped` and the existing `cast_value` path.
## Migration plan
1. Add `FilterValue` enum; change `Filter::values: Vec<FilterValue>`.
2. Keep existing `Filter::new` / `FilterField::eq/ne/...` signatures; have
them wrap in `Untyped` internally.
3. Add `Filter::new_typed` / typed builder methods.
4. Update `filter.rs:333+` apply path to dispatch on `FilterValue` variant.
5. Migrate `crates/datafusion/src/util/expr.rs` to produce `Typed` values;
delete `scalar_to_filter_value` and the decimal formatter.
6. Add round-trip property tests for every supported Arrow `DataType`:
`value → FilterValue::Untyped → cast_value → Arrow Scalar` must equal `value →
FilterValue::Typed → Arrow Scalar`.
7. Audit Python (`crates/pyo3`) and C++ (`cxx`) bindings — string-tuple API
stays, no binding changes needed.
## Out of scope
- **Option C — bypass `Filter` entirely for DataFusion row-level filters**
(use `PhysicalExpr` directly, as `ParquetSource::with_predicate` does). Could
be done independently in `crates/datafusion`; complementary to this proposal
but does not require it.
- **Option A — round-trip property tests as the only fix.** Tightens the
string contract without changing the API. Worth doing regardless; will land in
step 6 of this plan.
- Changes to partition pruning's value-vs-directory-token comparison (still
works on strings; that's the natural form).
## Use cases
- DataFusion pushdown of typed predicates (decimal, timestamp, date, binary)
without per-type stringification workarounds.
- Internal callers building filters from already-typed data (e.g. MDT
pruners) skip the string trip entirely.
- Future Arrow types added upstream don't immediately become correctness
bugs.
## Notes
- Deferred — tracking only. Decimal corner is patched; current code path is
functional.
- Cross-binding impact: Python and C++ APIs use string tuples and are
unaffected.
- Naming: `FilterValue` overlaps with DataFusion's `ScalarValue`
semantically; deliberate to keep the variant `Typed(ScalarValue)` direct.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]