Re: [PR] feat(experimental): ScalaUDF and Java UDF support via Janino codegen [datafusion-comet]

via GitHub Tue, 19 May 2026 15:17:56 -0700


mbutrovich commented on PR #4267:
URL: 
https://github.com/apache/datafusion-comet/pull/4267#issuecomment-4492525610


   Thanks @andygrove for the careful review and the AI pass, both turned up 
real issues.
   
   Changes pushed:
   
   - Doc: scalar UDF intro now links to Spark's scalar UDF guide and avoids the 
temporally fragile "no longer falls back" phrasing.
   - Dispatcher: stopped deriving spec nullability from per-batch 
`getNullCount`. The cache key is now a function of the bound expression bytes 
and the schema-stable input shapes only. `BoundReference.nullable` (Catalyst's 
schema-tracked flag, baked into the serialized bytes on the driver) is the sole 
source of nullability information, so schema-declared non-null columns still 
get full `isNullAt` elision via Spark's own `doGenCode`.
   - Dispatcher: replaced the single `activeKernel` slot with a kernel instance 
stashed in each `CacheEntry`. The instance is `init(partitionId)`'d once at 
compile time and reused for every batch of that `(expression, schema)`. Removed 
`ensureKernel`, `rewriteBoundReferences`, and the active-slot bookkeeping.
   
   The AI review caught two real correctness bugs the previous design had:
   
   - A nullability flip mid-partition (one batch has nulls, the next does not) 
reset the kernel and replayed any per-partition stateful counters 
(`MonotonicallyIncreasingID`, `Rand`'s `XORShiftRandom`). Reproduced with a 
200-row single-partition parquet at batch size 8 with a null range; the new 
test pins the invariant.
   - A plan with two distinct ScalaUDFs in one operator thrashed the single 
active-kernel slot per batch and reset state on every flip. Reproduced with two 
UDFs each wrapping `monotonically_increasing_id()`; new test pins it. Confirmed 
the same shape also fires when one UDF is applied to two columns of differing 
schema nullability (different `BoundReference` ordinals produce different cache 
keys), covered by a third test.
   
   The other AI suggestions turned out to be moot:
   
   - `getNullCount == -1` no longer matters since we no longer read 
`getNullCount`.
   - The `freshReferences` per-flip cost is no longer a concern since the 
kernel is instantiated exactly once per cache entry rather than per 
partition-flip.
   - The `AttributeReference.collect.distinct` ordering note: the existing 
comment already documents the load-bearing invariant (ordinals align with the 
data args we ship), and the framing the AI suggested would have been slightly 
misleading about what the executor does.
   
   PR description updated to drop the stale `dev/diffs/*.diff` bullet (artifact 
of when this was enabled by default).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat(experimental): ScalaUDF and Java UDF support via Janino codegen [datafusion-comet]

Reply via email to