mbutrovich opened a new issue, #4515:
URL: https://github.com/apache/datafusion-comet/issues/4515

   I'm running into this in multiple PRs that try to reduce the FFI deep copies 
in the system (#4393, #4507) so I need to start tracking it since I keep 
introducing workarounds on those branches.
   
   ## Background
   
   Comet operators serialize Spark catalyst types into a proto schema. The 
native planner then translates that into Arrow types and expects DataFusion 
physical expressions to produce arrays matching those Arrow types. Several 
DataFusion / `datafusion-spark` functions return arrays whose Arrow type does 
not match what Spark catalyst declares for the equivalent Spark expression.
   
   On `main` this drift was masked accidentally on the shuffle path: the 
JVM-to-native FFI boundary in `CometShuffleExchangeExec` deep-copies and 
re-imports each batch (PR #4393's `CometArrowStream.reconcileStreamSchema`), 
then the consuming native `ScanExec.build_record_batch` casts the actual schema 
to the declared schema. PR #4507 collapses the FFI boundary on the 
inlined-native-child path, which removes both the re-stamp and the cast point, 
so the drift surfaces directly as `Invalid argument error: column types must 
match schema types, expected ... but found ...` from DataFusion record batch 
validation.
   
   PR #4507 works around the drift by inserting a `Projection(Cast)` between 
the child and `ShuffleWriterExec` and emitting a warning per drifting column. 
That is treatment of the symptom; the underlying mismatches should be fixed in 
the function implementations so the cast layer becomes a no-op.
   
   ## Related issue
   
   - DataFusion 
[apache/datafusion#22602](https://github.com/apache/datafusion/issues/22602): 
`width_bucket` returns `Int32` instead of `Int64`. Single-function bug, not an 
umbrella tracker.
   
   ## Observed mismatches
   
   | Spark expression | Spark catalyst declares | DataFusion / datafusion-spark 
produces | First seen in |
   | --- | --- | --- | --- |
   | `width_bucket` | `LongType` (Int64) | `Int32` | PR #4393 investigation |
   | `date_trunc(unit, ts)` | `TimestampType` -> `Timestamp(us, "UTC")` | 
`Timestamp(us)` (no timezone) | CI for PR #4507, 
`expressions/datetime/trunc_timestamp_dst.sql` |
   | `collect_set(int)` (and likely other element types) | 
`ArrayType(IntegerType, containsNull = false)` -> `List(non-null Int32)` | 
`List(nullable Int32)` | CI for PR #4507, 
`expressions/aggregate/collect_set.sql` |
   
   This list is not exhaustive. Each warning logged by the new alignment cast 
in `planner.rs` `OpStruct::ShuffleWriter` (after PR #4507) is a candidate to 
add here.
   
   ## How to find more
   
   1. Run the Comet test suites with `RUST_LOG=warn` (or watch executor logs) 
on PR #4507 or later.
   2. Grep for the warning string emitted by the alignment cast (`ShuffleWriter 
input schema mismatch` or whatever wording lands in PR #4507).
   3. Each unique `expected ... vs actual ...` pair is an independent bug to 
file against the responsible function.
   
   ## Suggested next steps
   
   - File this list as a Comet umbrella issue tracking each mismatch as a 
sub-task.
   - For each entry: open or link an upstream DataFusion / `datafusion-spark` 
issue with a minimal reproducer and the catalyst-side declared type.
   - Once a function is fixed upstream and absorbed into Comet's pinned 
DataFusion version, the alignment cast becomes a no-op for that column and the 
warning stops firing. If no warnings remain in a release cycle, the alignment 
layer can be removed.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to