Re: [PR] feat: custom Rust UDFs [experimental] [skip ci] [datafusion-comet]

via GitHub Wed, 27 May 2026 04:57:13 -0700


timsaucer commented on PR #4283:
URL: 
https://github.com/apache/datafusion-comet/pull/4283#issuecomment-4554266702


   Ok, I asked my agent to help me with a pro/con list of switching to use 
`datafusion-ffi` crate for the scalar UDFs. Here is their review. For me the 
big question is if the versioning requirement would be a deal breaker. Right 
now we have some instability in `datafusion-ffi` which leads to requiring 
datafusion major versions to match. This has been an annoyance, but not a deal 
breaker for `datafusion-python`. The major advantage is that it will make 
adding the aggregates and windows trivial once you have the scalars UDFs.
   
   
   
   I read through 
`native/comet-udf-sdk/src/{lib,types,error,macros,adapter}.rs` and the 
`datafusion-ffi` udf module. Short answer: yes, this can be reworked to lean on 
`datafusion-ffi`, and a large fraction of the SDK becomes deletable. The harder 
question is whether you should — and the answer there is mostly yes, with one 
real tradeoff worth surfacing in review.
   
   ## What the rework looks like concretely
   
   The Comet SDK has five concerns, and four of them are already in 
`datafusion-ffi`:
   
   | Comet SDK piece | datafusion-ffi equivalent |
   |---|---|
   | `CometScalarUdf` trait (`name`/`signature`/`invoke`) | `ScalarUDFImpl` 
directly |
   | `UdfDescriptor` `#[repr(C)]` + `offset_of` asserts | `FFI_ScalarUDF` with 
stabby-checked layout |
   | `ArrowTypeTag` + IPC FlatBuffer bytes for non-primitives | `WrappedSchema` 
(wraps `FFI_ArrowSchema` for everything, primitive or not) |
   | `UdfError` (`#[repr(C)]` + code + raw C-string) | `FFI_Result<...>` + 
DataFusion's `Result` round-tripped through stabby |
   | `from_scalar_udf_impl` adapter (feature-gated) | this *is* the FFI; no 
adapter needed |
   | Custom `Volatility` enum | `FFI_Volatility` |
   | `EncodedSignature`, manual descriptor lifetime mgmt | stabby's 
`SString`/`SVec` handle this |
   
   What stays in Comet is roughly:
   
   1. **A very thin discovery ABI** — something like `extern "C" fn 
comet_udf_list_v1(out: *mut SVec<FFI_ScalarUDF>) -> i32`. You still need this 
because `datafusion-ffi` describes a *single* UDF; it doesn't define the "open 
this .so, ask it what UDFs it has" entry point. That's the only piece Comet 
legitimately needs to own. It's maybe 30-50 lines plus an `export!` macro.
   2. **The host-side loader** (`loader.rs`, `cache.rs`) — basically unchanged, 
but `RustUdfHandle` holds an `Arc<ScalarUDF>` built from a `ForeignScalarUDF` 
instead of raw `InvokeFn`/`FreeErrFn` pointers.
   3. **The planner branch and JNI bridge** — unchanged in shape; the JNI 
bridge actually gets simpler because you can `clone()` an `FFI_ScalarUDF` on 
the driver and ask it directly for name/signature/return type instead of 
round-tripping through hand-coded JSON.
   4. **`RustUdfAdapter`** — deletable. `ForeignScalarUDF` already implements 
`ScalarUDFImpl`, so it goes straight into the existing 
`ScalarUDF::new_from_impl(...)` path in `PhysicalPlanner`.
   
   That's roughly ~1000 of the ~1300 lines under `native/comet-udf-sdk/` and 
`native/core/src/execution/rust_udf/` either deleted or significantly 
simplified.
   
   ## Benefits
   
   The big functional wins come from inheriting things the PR explicitly punts 
on. The user guide and `adapter.rs` both call out "TypeSignature::Exact only" — 
that's because deriving an Arrow type list at describe-time is the only thing 
the custom `ArrowTypeTag` scheme can express. With `FFI_ScalarUDF`, 
`coerce_types` and `return_field_from_args` are part of the surface, so 
variadic signatures, type coercion, and metadata/nullability-dependent return 
types all work without redesign. Same goes for aliases (Comet has none today), 
`short_circuits`, and full `ConfigOptions` propagation. The struct-type DDL 
parsing limitation called out as a `registerAll` test cancellation goes away 
too — `FFI_ArrowSchema` round-trips arbitrary Arrow types including nested 
structs without going through Spark's DDL parser.
   
   The other big win is that aggregate and window UDFs (deferred in the PR) 
become a straight extension: `FFI_AggregateUDF` and `FFI_WindowUDF` already 
exist with the same shape, so the discovery ABI just grows from 
`comet_udf_list_v1` to a small enum of UDF kinds, instead of needing a parallel 
`comet_aggregate_udf_*` ABI later.
   
   From a correctness angle, stabby handles a few sharp edges in the current 
code:
   - The `UdfError::set` / `free_in_place` design correctly forces the host to 
call `comet_udf_free_error` in the same cdylib (you documented this), but it's 
load-bearing — if a future contributor cross-cdylib calls `CString::from_raw` 
on a pointer from a different allocator, it's UB. stabby's `SString` is 
allocator-agnostic.
   - The `#[cfg(target_pointer_width = "64")]` gating on layout asserts means 
32-bit is implicitly unsupported — fine, but stabby makes that explicit and 
refuses to compile on layouts it can't guarantee.
   - The hand-rolled `EncodedSignature` pinning into a `OnceLock` to keep raw 
pointers stable for the process lifetime is replaced by stabby-owned 
`SVec`/`SString` whose lifetimes are tied to the `FFI_ScalarUDF` value.
   
   ## Drawbacks — the honest list
   
   **Arrow-version decoupling is the one claim that genuinely weakens.** The PR 
states user libraries are "decoupled from Comet's `arrow-rs` and `datafusion` 
versions: the only stability contract is the SDK ABI version." That claim is 
already softer than it sounds — `lib.rs`'s docs note "You also need a direct 
dependency on `arrow` (matching the version used by Comet's host)." So the 
decoupling is at the Arrow *C Data Interface* level (the data layout), not the 
Rust *type* level. The user already needs a matching `arrow` crate.
   
   Moving to `datafusion-ffi` adds `datafusion-ffi` itself (plus transitively 
`datafusion-common` and `datafusion-expr`) to the user's `Cargo.toml`. The 
whole point of `datafusion-ffi` is to be ABI-stable across DataFusion versions, 
but the user does pull more code in. Realistically that's an extra 10-30 MB of 
stripped cdylib and a chunk of build time. For Spark `--files` distribution 
this isn't great but isn't a dealbreaker either.
   
   **User-facing API surface is larger.** `CometScalarUdf` is three methods. 
`ScalarUDFImpl` is ~ten plus a `Signature`. For someone writing "downcast to 
`Int64Array`, return `Int64Array`," the current SDK is friendlier. Users who 
already know DataFusion get reuse, and the example in the user guide barely 
changes; users who don't get a steeper ramp. You could ease this with a thin 
re-export / type alias in `comet-udf-sdk` so the public surface still looks 
like "Comet's SDK" while the trait underneath is `ScalarUDFImpl`.
   
   **Velocity coupling to datafusion-ffi.** If you hit a bug in stabby's 
behavior or in an `FFI_ScalarUDF` field, you're now waiting on a datafusion 
release (or vendoring). The flip side is the bugs being fixed are not Comet's 
to fix.
   
   **Panic semantics.** The Comet SDK has very deliberate `catch_unwind` 
placement in `invoke_impl` with a distinct `UdfErrorCode::Panic` so the host 
can attribute panics versus user errors. `datafusion-ffi` catches panics inside 
its FFI wrappers and surfaces them as `FFI_Result::Err`, but the user-error / 
panic distinction is collapsed — both become `DataFusionError`. If the 
Spark-facing exception classification (`CometRustUdfPanicException` etc.) is 
meant to stay distinct, you'd want to verify that distinction survives or layer 
it back on (e.g. checking message prefix, or adding a panic flag to 
`FFI_Result` upstream).
   
   **Loss of "Comet owns the wire format."** Today the wire format is in 
`comet-udf-sdk` and Comet controls when ABI v2 happens. Switch to 
`datafusion-ffi` and the wire format is `FFI_ScalarUDF`'s layout, which is 
owned upstream and stabilized through stabby. That's *fine* — but it does mean 
a layout change in `datafusion-ffi` forces a Comet bump, and 
`COMET_UDF_ABI_VERSION` either goes away or becomes a (Comet discovery version, 
datafusion-ffi version) tuple.
   
   ## Recommendation
   
   Two things to verify before committing to the direction:
   - Whether `ForeignScalarUDF` preserves the user-error vs panic distinction 
the way Comet's typed exceptions need it to.
   - Whether the discovery ABI (`comet_udf_list_v1`) can return 
`SVec<FFI_ScalarUDF>` cleanly, or whether you want a small wrapper enum to 
leave room for `FFI_AggregateUDF` / `FFI_WindowUDF` in the same discovery call 
without breaking the discovery ABI later.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: custom Rust UDFs [experimental] [skip ci] [datafusion-comet]

Reply via email to