andygrove opened a new pull request, #4283:
URL: https://github.com/apache/datafusion-comet/pull/4283

   ## Which issue does this PR close?
   
   Closes #.
   
   ## Rationale for this change
   
   Comet today supports user UDFs only through the JVM `CometUDF` path: a 
Scala/Java callback invoked over JNI for every batch. The user's 
`evaluate(Array[ValueVector])` body either loops in Scala or reaches for Arrow 
Java's compute kernels, both slower than the `arrow-rs` kernels Comet itself 
uses natively.
   
   This PR (experimental, draft) adds a parallel path for **scalar UDFs in 
Rust**. The user implements a small trait against `arrow-rs`, builds their 
crate as a `cdylib`, and registers the resulting `.so` / `.dylib` from Scala. 
Comet loads the library inside the executor and dispatches to it directly 
during native execution — no JVM round-trip per row.
   
   The cross-`.so` boundary uses the **Arrow C Data Interface** 
(`FFI_ArrowArray` / `FFI_ArrowSchema`), so user libraries are decoupled from 
Comet's `arrow-rs` and `datafusion` versions: the only stability contract is 
the SDK ABI version (currently `1`).
   
   ## What changes are included in this PR?
   
   Three new pieces, plus narrow integration in existing Comet:
   
   - **`comet-udf-sdk`** — public Rust crate. Defines `CometScalarUdf`, 
signature / type-tag / error types, an `export!` macro emitting versioned 
`extern "C"` entry points, and an optional `from_scalar_udf_impl` adapter 
behind the `datafusion-adapter` feature.
   - **`comet-test-udfs`** — in-tree test cdylib exposing five UDFs (happy 
path, struct-typed, user error, panic, length mismatch) used by host and 
end-to-end tests.
   - **`rust_udf` module** in `native/core` — `loader` (libloading + ABI check 
+ descriptor parse), process-wide `cache`, and `RustUdfAdapter` impl 
`ScalarUDFImpl`.
   - **`RustUdfCall` proto** in `expr.proto` and a planner branch in 
`create_expr` that resolves the call against the cache and wraps the adapter as 
a `ScalarUDF`.
   - **JNI bridge** (`CometRustUdfBridge` / `comet_rust_udf_bridge.rs`) for 
driver-side `validateLibrary` / `listUdfs`.
   - **Scala API** — `CometRustUDF.register` / `registerAll`, 
`CometRustUdfRegistry`, typed exception classes.
   - **`QueryPlanSerde` branch** that recognizes a `ScalaUDF` whose name is 
registered and emits `RustUdfCall` instead.
   - **User guide** at `docs/source/user-guide/latest/custom-rust-udfs.md`.
   
   Marked experimental: scope is intentionally scalar-only, dynamic-library 
loading only, no JVM fallback, library distribution is the user's 
responsibility (Spark `--files` or pre-install). Aggregate / window / 
table-valued UDFs and richer nested-type signature mapping are deliberately 
deferred.
   
   ## How are these changes tested?
   
   - **SDK unit tests** (`comet-udf-sdk`) — 11 tests covering type-tag 
round-trip, IPC field encoding, error types, layout assertions for both 
`UdfError` and `UdfDescriptor`, the `EncodedSignature` builder, and the 
optional DataFusion adapter (signature derivation, scalar materialization, 
non-Exact rejection).
   - **Native host tests** (`native/core/src/execution/rust_udf/`) — 9 tests 
covering library load + ABI check, descriptor parse for primitive and 
struct-typed UDFs, process-wide cache identity, and four async tokio adapter 
tests that run UDFs end-to-end through DataFusion (happy path, user error, 
panic, length mismatch).
   - **Driver-side Scala suite** (`CometRustUdfRegistrySuite`) — 3 tests 
covering register / re-register / snapshot semantics on the driver registry.
   - **End-to-end Spark suite** (`CometRustUdfSuite`) — 6 tests pass: native 
execution of `add_one`, error / panic surfacing, missing-path failure, 
signature mismatch failure, and `registerAll` for primitive-typed UDFs. One 
test (`registerAll` over the struct-typed fixture) is currently cancelled — it 
hits a v1 limitation around mapping Arrow's `DataType::to_string` output for 
`Struct` to a Spark DDL parser-acceptable form. Documented in the user guide's 
Limitations section; works fine via explicit `register` with declared types.
   
   The end-to-end suite is gated on `-Dcomet.test.udfs.lib=<path to 
libcomet_test_udfs>`; the path is plumbed through scalatest's 
`systemProperties` in the root `pom.xml`. The Rust test crate's `core/build.rs` 
exposes the same path to native tests via the `COMET_TEST_UDFS_LIB` env var.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to