mbutrovich commented on PR #4232:
URL: 
https://github.com/apache/datafusion-comet/pull/4232#issuecomment-4391357561

   This PR got me thinking about whether the per-expression `CometUDF` pattern 
could be generalized, and I prototyped a generic dispatcher on top of this PR's 
framework. Branch: 
https://github.com/mbutrovich/datafusion-comet/tree/jvm-udf-generic-dispatcher. 
The core file is 
[`CometGenericExpressionUDF.scala`](https://github.com/mbutrovich/datafusion-comet/blob/jvm-udf-generic-dispatcher/common/src/main/scala/org/apache/comet/udf/CometGenericExpressionUDF.scala).
   
   ## What it does
   
   `CometGenericExpressionUDF` is one `CometUDF` class that evaluates an 
arbitrary Spark `Expression` tree. At plan time, the serde registers the bound 
expression in `CometLambdaRegistry` (UUID-keyed), emits a `JvmScalarUdf` proto 
pointing at the generic class, and passes data attributes as args. At execution 
time the UDF looks up the expression, compiles it once via 
`GenerateMutableProjection`, and loops over the input Arrow vectors using a 
reused `SpecificInternalRow`.
   
   ## Benefits
   
   One JVM-side class handles any scalar Spark Expression, with no 
per-expression hand-coded evaluator required.
   
   The dispatcher evaluates composed expression trees in one JNI hop. If a 
child node (e.g. `upper(col)` inside `rlike(upper(col), pattern)`) isn't 
supported natively, the whole tree still evaluates together without forcing 
whole-plan fallback.
   
   Benchmarks competitively with the hand-coded `RegExpLikeUDF` from #4239. 
Spark's `MutableProjection` codegen produces the same hot-loop shape (bytes to 
`UTF8String` to eval to result) that a hand-written loop does, so there is no 
inherent per-row dispatcher overhead.
   
   One Janino compile per expression tree, cached by registry key.
   
   ## Limitations
   
   ### Near-term, fixable with incremental work
   
   - Types are prototype-narrow. Input is `VarCharVector` only, output is 
`BitVector` only. Widening is mechanical: build an `Array[ColumnReader]` and a 
`ResultWriter` at cache-miss time, dispatching on Arrow type and 
`expression.dataType` once per expression. Scaladoc in the prototype file 
sketches the shape.
   - `CometLambdaRegistry` is JVM-local. Driver and executor sharing a JVM 
works for local Spark only. Cluster mode requires serializing the bound 
`Expression` into the proto (Java serialization or Kryo) and dropping the UUID 
key.
   - Only `CometRLike` is wired for the generic path in this prototype. The 
serde logic is not `RLike`-specific and can be extracted to a single helper 
that any `CometExpressionSerde` opts into with one line.
   - `Nondeterministic` expressions (`rand`, `monotonically_increasing_id`) 
need an `initialize(partitionIndex)` call before the first row. Easy to add at 
cache-miss time.
   - Registry entries are never removed, which leaks for long-running drivers.
   - `VarCharVector.get(i)` copies bytes into a fresh `byte[]` per row. Matches 
`RegExpLikeUDF`, so the A/B comparison is fair, but both paths would improve 
with a reusable `NullableVarCharHolder` or `UTF8String.fromAddress`.
   
   ### Longer-term, may never fully reach
   
   - Aggregates, window functions, and generators do not fit the 
`CometUdfBridge` "one result vector per input, same length" contract. Each 
needs its own bridge signature.
   - Python and Pandas UDFs are reachable in principle (they are `Expression` 
subclasses). Whether the per-row socket IPC to the Python worker is cheaper 
than whole-plan fallback would need to be measured.
   - Performance parity with native Rust on expressions that emit per-row 
allocations (decimals, arrays, strings out) is unlikely. JVM boxing through 
`UnsafeRow` and `ArrayData` is inherent to the evaluation shape, whereas native 
Rust writes directly into Arrow buffers.
   - Cross-Spark-version stability of Expression serialization is fragile. 
Spark internals change between releases, and a cluster-mode implementation 
would need a compatibility story.
   
   ## Benchmark numbers
   
   Per-row nanoseconds, lower is better. Apple M3 Max, OpenJDK 11.0.30+7-LTS, 
macOS 26.4.1. Source: `CometRegExpBenchmark` with one extra case added for the 
generic dispatcher.
   
   | Pattern | Spark | Comet (Scan) | Comet (Exec, native Rust) | Comet (Exec, 
JVM hand-coded) | Comet (Exec, JVM generic) |
   |---|---|---|---|---|---|
   | character_class `[0-9]+` | 12561.0 | 10616.9 | 4764.3 | 4377.9 | 4293.4 |
   | anchored `^[0-9]` | 9077.1 | 8776.9 | 3463.7 | 3487.0 | 3384.8 |
   | alternation `abc\|def\|ghi` | 12189.4 | 11970.7 | 6837.2 | 6497.1 | 6785.3 
|
   | multi_class `[a-zA-Z][0-9]+` | 9394.9 | 10048.6 | 4272.1 | 4193.9 | 4343.2 
|
   | repetition `(ab){2,}` | 9160.1 | 9146.7 | 4086.7 | 4075.5 | 4125.5 |
   
   The generic path tracks the hand-coded path within a few percent across all 
five patterns. Native Rust is competitive but not dominant on these patterns, 
likely because the workload favors JIT-warmed backtracking over DFA 
construction. On adversarial patterns or non-regex expressions with tight Rust 
kernels, native would be expected to pull further ahead.
   
   I think this is a super promising direction to more quickly (and provide 
100% compatibility) support UDFs! Thanks @andygrove!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to