mbutrovich commented on PR #4267: URL: https://github.com/apache/datafusion-comet/pull/4267#issuecomment-4492525610
Thanks @andygrove for the careful review and the AI pass, both turned up real issues. Changes pushed: - Doc: scalar UDF intro now links to Spark's scalar UDF guide and avoids the temporally fragile "no longer falls back" phrasing. - Dispatcher: stopped deriving spec nullability from per-batch `getNullCount`. The cache key is now a function of the bound expression bytes and the schema-stable input shapes only. `BoundReference.nullable` (Catalyst's schema-tracked flag, baked into the serialized bytes on the driver) is the sole source of nullability information, so schema-declared non-null columns still get full `isNullAt` elision via Spark's own `doGenCode`. - Dispatcher: replaced the single `activeKernel` slot with a kernel instance stashed in each `CacheEntry`. The instance is `init(partitionId)`'d once at compile time and reused for every batch of that `(expression, schema)`. Removed `ensureKernel`, `rewriteBoundReferences`, and the active-slot bookkeeping. The AI review caught two real correctness bugs the previous design had: - A nullability flip mid-partition (one batch has nulls, the next does not) reset the kernel and replayed any per-partition stateful counters (`MonotonicallyIncreasingID`, `Rand`'s `XORShiftRandom`). Reproduced with a 200-row single-partition parquet at batch size 8 with a null range; the new test pins the invariant. - A plan with two distinct ScalaUDFs in one operator thrashed the single active-kernel slot per batch and reset state on every flip. Reproduced with two UDFs each wrapping `monotonically_increasing_id()`; new test pins it. Confirmed the same shape also fires when one UDF is applied to two columns of differing schema nullability (different `BoundReference` ordinals produce different cache keys), covered by a third test. The other AI suggestions turned out to be moot: - `getNullCount == -1` no longer matters since we no longer read `getNullCount`. - The `freshReferences` per-flip cost is no longer a concern since the kernel is instantiated exactly once per cache entry rather than per partition-flip. - The `AttributeReference.collect.distinct` ordering note: the existing comment already documents the load-bearing invariant (ordinals align with the data args we ship), and the framing the AI suggested would have been slightly misleading about what the executor does. PR description updated to drop the stale `dev/diffs/*.diff` bullet (artifact of when this was enabled by default). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
