zhuqi-lucas commented on issue #21290:
URL: https://github.com/apache/datafusion/issues/21290#issuecomment-4170521160

   Thanks for looking into this. After extensive testing, here's what I found:
   
   **DF 51 works, DF 52 fails** — confirmed by reverting our codebase to the 
pre-DF-52-upgrade commit and running the same queries against the same data. 
All queries succeed on DF 51, several fail on DF 52.
   
   However, I was unable to create a standalone MRE using vanilla DataFusion 
APIs that reproduces the difference. The scenarios I tested (missing 
non-nullable column, List inner field mismatch, nullable file data + 
non-nullable table schema) either fail on both versions or succeed on both.
   
   Our system uses a custom `ExecutionPlanFactory` that creates `ParquetSource` 
directly and wires it through 
`FileScanConfigBuilder::from(base_config).with_source(new_source)`. The 
interaction between `SchemaAdapter::map_batch()` (DF 51) and our custom path 
appears to be what handled the edge cases — but I can't isolate it into a 
minimal example because the behavior depends on our full `FileScanTable` → 
`ParquetExecFactory` → `ParquetSource` pipeline.
   
   Given that `PhysicalExprAdapterFactory` does cover the same cases in vanilla 
DF, I think the issue is likely in how we're wiring things up after the 
migration. I'll continue debugging on our side.
   
   One concrete thing that might help other DF 52 migrators: the 
`replace_schema` block in `ParquetOpener` (L600-617 in `opener.rs`) does 
`RecordBatch::try_new_with_options(output_schema, arrays)` without any column 
casting — it just swaps the schema. In DF 51, `SchemaAdapter::map_batch()` 
called `arrow::compute::cast()` on each column before creating the RecordBatch, 
which handled subtle differences like List inner field metadata. If anyone has 
a custom setup where the output schema doesn't exactly match the physical file 
schema (including inner field names/nullability), the `replace_schema` block 
will reject it.
   
   Thanks again for the help. I'll close this if we determine it's fully on our 
side.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to