mbutrovich opened a new issue, #4515: URL: https://github.com/apache/datafusion-comet/issues/4515
I'm running into this in multiple PRs that try to reduce the FFI deep copies in the system (#4393, #4507) so I need to start tracking it since I keep introducing workarounds on those branches. ## Background Comet operators serialize Spark catalyst types into a proto schema. The native planner then translates that into Arrow types and expects DataFusion physical expressions to produce arrays matching those Arrow types. Several DataFusion / `datafusion-spark` functions return arrays whose Arrow type does not match what Spark catalyst declares for the equivalent Spark expression. On `main` this drift was masked accidentally on the shuffle path: the JVM-to-native FFI boundary in `CometShuffleExchangeExec` deep-copies and re-imports each batch (PR #4393's `CometArrowStream.reconcileStreamSchema`), then the consuming native `ScanExec.build_record_batch` casts the actual schema to the declared schema. PR #4507 collapses the FFI boundary on the inlined-native-child path, which removes both the re-stamp and the cast point, so the drift surfaces directly as `Invalid argument error: column types must match schema types, expected ... but found ...` from DataFusion record batch validation. PR #4507 works around the drift by inserting a `Projection(Cast)` between the child and `ShuffleWriterExec` and emitting a warning per drifting column. That is treatment of the symptom; the underlying mismatches should be fixed in the function implementations so the cast layer becomes a no-op. ## Related issue - DataFusion [apache/datafusion#22602](https://github.com/apache/datafusion/issues/22602): `width_bucket` returns `Int32` instead of `Int64`. Single-function bug, not an umbrella tracker. ## Observed mismatches | Spark expression | Spark catalyst declares | DataFusion / datafusion-spark produces | First seen in | | --- | --- | --- | --- | | `width_bucket` | `LongType` (Int64) | `Int32` | PR #4393 investigation | | `date_trunc(unit, ts)` | `TimestampType` -> `Timestamp(us, "UTC")` | `Timestamp(us)` (no timezone) | CI for PR #4507, `expressions/datetime/trunc_timestamp_dst.sql` | | `collect_set(int)` (and likely other element types) | `ArrayType(IntegerType, containsNull = false)` -> `List(non-null Int32)` | `List(nullable Int32)` | CI for PR #4507, `expressions/aggregate/collect_set.sql` | This list is not exhaustive. Each warning logged by the new alignment cast in `planner.rs` `OpStruct::ShuffleWriter` (after PR #4507) is a candidate to add here. ## How to find more 1. Run the Comet test suites with `RUST_LOG=warn` (or watch executor logs) on PR #4507 or later. 2. Grep for the warning string emitted by the alignment cast (`ShuffleWriter input schema mismatch` or whatever wording lands in PR #4507). 3. Each unique `expected ... vs actual ...` pair is an independent bug to file against the responsible function. ## Suggested next steps - File this list as a Comet umbrella issue tracking each mismatch as a sub-task. - For each entry: open or link an upstream DataFusion / `datafusion-spark` issue with a minimal reproducer and the catalyst-side declared type. - Once a function is fixed upstream and absorbed into Comet's pinned DataFusion version, the alignment cast becomes a no-op for that column and the warning stops firing. If no warnings remain in a release cycle, the alignment layer can be removed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
