Re: [PR] feat(experimental): ScalaUDF and Java UDF support via Janino codegen [datafusion-comet]

via GitHub Tue, 19 May 2026 08:29:01 -0700


andygrove commented on PR #4267:
URL: 
https://github.com/apache/datafusion-comet/pull/4267#issuecomment-4489343787


   AI review:
   
   - On CometScalaUDFCodegen.scala:149 (ensureKernel):
     "Could you walk me through what happens to Rand and 
MonotonicallyIncreasingID state when a partition has batches with different 
nullability profiles? The cache key includes per-batch nullable(v) = 
     v.getNullCount != 0, so a nullability flip inside the partition will land 
on a different activeKey, drop the existing activeKernel, and 
re-init(partitionId) the new one. If batches flip back and forth,
     each transition reseeds XORShiftRandom and resets the ID counter, which 
could produce duplicate rand sequences or overlapping IDs. Is there a test that 
pins a single partition's rand /
     monotonically_increasing_id output across a nullability flip?"
     - On CometScalaUDFCodegen.scala:213 (nullable(v)): 
     "Arrow Java's getNullCountcan return-1for vectors whose null count hasn't 
been computed yet (depending on how the vector was constructed). On the FFI 
import path from native, is there a 
     guaranteegetNullCountis materialized before we observe it here? A-1would 
compare!= 0 and the kernel would compile as nullable, which is harmless. The 
opposite direction (0` returned for a vector that does
      have nulls) is what would worry me."
     - On CometScalaUDF.scala:62 (expr.collect { case a: AttributeReference => 
a }.distinct):
     "AttributeReference equality includes exprId so this should dedupe 
correctly, but it might be worth a comment noting that the resulting ordinal 
order is determined by tree traversal order (so the executor
      side has to recompute the same bound positions from the same tree)."
     - On CometBatchKernelCodegen.scala:212-218 (freshReferences):
     "The freshReferences closure captures boundExpr and inputSchema and 
re-runs generateSource each time. The doc says this is microseconds versus 
milliseconds for Janino compile, which is fine, but if a
     partition flips between two kernels often this becomes per-flip work. 
Worth a follow-up TODO to cache the references array per cache entry and only 
regenerate when ScalaUDF stateful encoders force it?"
     - On the PR description (Spark test diffs bullet):
     "The PR description mentions dev/diffs/*.diff updates totaling 198 lines, 
but I don't see them in the current diff (looks like commit a15935748 rolled 
them back). If Matt's earlier comment about "4 Spark
     SQL test failures" still applies, those diff updates might want to come 
back. If those failures resolved themselves with the rebase, the bullet in the 
description can drop."
     - General (no specific anchor):
     "All CI green, very thorough test coverage, well-documented internal 
invariants. Disabled by default makes the experimental landing low-risk. Once 
the rand/MonotonicallyIncreasingID lifetime question is
     settled (either with a test or with a code clarification) I think this is 
good to merge and iterate from main."
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat(experimental): ScalaUDF and Java UDF support via Janino codegen [datafusion-comet]

Reply via email to