andygrove commented on PR #4267:
URL:
https://github.com/apache/datafusion-comet/pull/4267#issuecomment-4489343787
AI review:
- On CometScalaUDFCodegen.scala:149 (ensureKernel):
"Could you walk me through what happens to Rand and
MonotonicallyIncreasingID state when a partition has batches with different
nullability profiles? The cache key includes per-batch nullable(v) =
v.getNullCount != 0, so a nullability flip inside the partition will land
on a different activeKey, drop the existing activeKernel, and
re-init(partitionId) the new one. If batches flip back and forth,
each transition reseeds XORShiftRandom and resets the ID counter, which
could produce duplicate rand sequences or overlapping IDs. Is there a test that
pins a single partition's rand /
monotonically_increasing_id output across a nullability flip?"
- On CometScalaUDFCodegen.scala:213 (nullable(v)):
"Arrow Java's getNullCountcan return-1for vectors whose null count hasn't
been computed yet (depending on how the vector was constructed). On the FFI
import path from native, is there a
guaranteegetNullCountis materialized before we observe it here? A-1would
compare!= 0 and the kernel would compile as nullable, which is harmless. The
opposite direction (0` returned for a vector that does
have nulls) is what would worry me."
- On CometScalaUDF.scala:62 (expr.collect { case a: AttributeReference =>
a }.distinct):
"AttributeReference equality includes exprId so this should dedupe
correctly, but it might be worth a comment noting that the resulting ordinal
order is determined by tree traversal order (so the executor
side has to recompute the same bound positions from the same tree)."
- On CometBatchKernelCodegen.scala:212-218 (freshReferences):
"The freshReferences closure captures boundExpr and inputSchema and
re-runs generateSource each time. The doc says this is microseconds versus
milliseconds for Janino compile, which is fine, but if a
partition flips between two kernels often this becomes per-flip work.
Worth a follow-up TODO to cache the references array per cache entry and only
regenerate when ScalaUDF stateful encoders force it?"
- On the PR description (Spark test diffs bullet):
"The PR description mentions dev/diffs/*.diff updates totaling 198 lines,
but I don't see them in the current diff (looks like commit a15935748 rolled
them back). If Matt's earlier comment about "4 Spark
SQL test failures" still applies, those diff updates might want to come
back. If those failures resolved themselves with the rebase, the bullet in the
description can drop."
- General (no specific anchor):
"All CI green, very thorough test coverage, well-documented internal
invariants. Disabled by default makes the experimental landing low-risk. Once
the rand/MonotonicallyIncreasingID lifetime question is
settled (either with a test or with a code clarification) I think this is
good to merge and iterate from main."
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]