mbutrovich commented on PR #4232: URL: https://github.com/apache/datafusion-comet/pull/4232#issuecomment-4391357561
This PR got me thinking about whether the per-expression `CometUDF` pattern could be generalized, and I prototyped a generic dispatcher on top of this PR's framework. Branch: https://github.com/mbutrovich/datafusion-comet/tree/jvm-udf-generic-dispatcher. The core file is [`CometGenericExpressionUDF.scala`](https://github.com/mbutrovich/datafusion-comet/blob/jvm-udf-generic-dispatcher/common/src/main/scala/org/apache/comet/udf/CometGenericExpressionUDF.scala). ## What it does `CometGenericExpressionUDF` is one `CometUDF` class that evaluates an arbitrary Spark `Expression` tree. At plan time, the serde registers the bound expression in `CometLambdaRegistry` (UUID-keyed), emits a `JvmScalarUdf` proto pointing at the generic class, and passes data attributes as args. At execution time the UDF looks up the expression, compiles it once via `GenerateMutableProjection`, and loops over the input Arrow vectors using a reused `SpecificInternalRow`. ## Benefits One JVM-side class handles any scalar Spark Expression, with no per-expression hand-coded evaluator required. The dispatcher evaluates composed expression trees in one JNI hop. If a child node (e.g. `upper(col)` inside `rlike(upper(col), pattern)`) isn't supported natively, the whole tree still evaluates together without forcing whole-plan fallback. Benchmarks competitively with the hand-coded `RegExpLikeUDF` from #4239. Spark's `MutableProjection` codegen produces the same hot-loop shape (bytes to `UTF8String` to eval to result) that a hand-written loop does, so there is no inherent per-row dispatcher overhead. One Janino compile per expression tree, cached by registry key. ## Limitations ### Near-term, fixable with incremental work - Types are prototype-narrow. Input is `VarCharVector` only, output is `BitVector` only. Widening is mechanical: build an `Array[ColumnReader]` and a `ResultWriter` at cache-miss time, dispatching on Arrow type and `expression.dataType` once per expression. Scaladoc in the prototype file sketches the shape. - `CometLambdaRegistry` is JVM-local. Driver and executor sharing a JVM works for local Spark only. Cluster mode requires serializing the bound `Expression` into the proto (Java serialization or Kryo) and dropping the UUID key. - Only `CometRLike` is wired for the generic path in this prototype. The serde logic is not `RLike`-specific and can be extracted to a single helper that any `CometExpressionSerde` opts into with one line. - `Nondeterministic` expressions (`rand`, `monotonically_increasing_id`) need an `initialize(partitionIndex)` call before the first row. Easy to add at cache-miss time. - Registry entries are never removed, which leaks for long-running drivers. - `VarCharVector.get(i)` copies bytes into a fresh `byte[]` per row. Matches `RegExpLikeUDF`, so the A/B comparison is fair, but both paths would improve with a reusable `NullableVarCharHolder` or `UTF8String.fromAddress`. ### Longer-term, may never fully reach - Aggregates, window functions, and generators do not fit the `CometUdfBridge` "one result vector per input, same length" contract. Each needs its own bridge signature. - Python and Pandas UDFs are reachable in principle (they are `Expression` subclasses). Whether the per-row socket IPC to the Python worker is cheaper than whole-plan fallback would need to be measured. - Performance parity with native Rust on expressions that emit per-row allocations (decimals, arrays, strings out) is unlikely. JVM boxing through `UnsafeRow` and `ArrayData` is inherent to the evaluation shape, whereas native Rust writes directly into Arrow buffers. - Cross-Spark-version stability of Expression serialization is fragile. Spark internals change between releases, and a cluster-mode implementation would need a compatibility story. ## Benchmark numbers Per-row nanoseconds, lower is better. Apple M3 Max, OpenJDK 11.0.30+7-LTS, macOS 26.4.1. Source: `CometRegExpBenchmark` with one extra case added for the generic dispatcher. | Pattern | Spark | Comet (Scan) | Comet (Exec, native Rust) | Comet (Exec, JVM hand-coded) | Comet (Exec, JVM generic) | |---|---|---|---|---|---| | character_class `[0-9]+` | 12561.0 | 10616.9 | 4764.3 | 4377.9 | 4293.4 | | anchored `^[0-9]` | 9077.1 | 8776.9 | 3463.7 | 3487.0 | 3384.8 | | alternation `abc\|def\|ghi` | 12189.4 | 11970.7 | 6837.2 | 6497.1 | 6785.3 | | multi_class `[a-zA-Z][0-9]+` | 9394.9 | 10048.6 | 4272.1 | 4193.9 | 4343.2 | | repetition `(ab){2,}` | 9160.1 | 9146.7 | 4086.7 | 4075.5 | 4125.5 | The generic path tracks the hand-coded path within a few percent across all five patterns. Native Rust is competitive but not dominant on these patterns, likely because the workload favors JIT-warmed backtracking over DFA construction. On adversarial patterns or non-regex expressions with tight Rust kernels, native would be expected to pull further ahead. I think this is a super promising direction to more quickly (and provide 100% compatibility) support UDFs! Thanks @andygrove! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
