zhuqi-lucas commented on issue #21290: URL: https://github.com/apache/datafusion/issues/21290#issuecomment-4170521160
Thanks for looking into this. After extensive testing, here's what I found: **DF 51 works, DF 52 fails** — confirmed by reverting our codebase to the pre-DF-52-upgrade commit and running the same queries against the same data. All queries succeed on DF 51, several fail on DF 52. However, I was unable to create a standalone MRE using vanilla DataFusion APIs that reproduces the difference. The scenarios I tested (missing non-nullable column, List inner field mismatch, nullable file data + non-nullable table schema) either fail on both versions or succeed on both. Our system uses a custom `ExecutionPlanFactory` that creates `ParquetSource` directly and wires it through `FileScanConfigBuilder::from(base_config).with_source(new_source)`. The interaction between `SchemaAdapter::map_batch()` (DF 51) and our custom path appears to be what handled the edge cases — but I can't isolate it into a minimal example because the behavior depends on our full `FileScanTable` → `ParquetExecFactory` → `ParquetSource` pipeline. Given that `PhysicalExprAdapterFactory` does cover the same cases in vanilla DF, I think the issue is likely in how we're wiring things up after the migration. I'll continue debugging on our side. One concrete thing that might help other DF 52 migrators: the `replace_schema` block in `ParquetOpener` (L600-617 in `opener.rs`) does `RecordBatch::try_new_with_options(output_schema, arrays)` without any column casting — it just swaps the schema. In DF 51, `SchemaAdapter::map_batch()` called `arrow::compute::cast()` on each column before creating the RecordBatch, which handled subtle differences like List inner field metadata. If anyone has a custom setup where the output schema doesn't exactly match the physical file schema (including inner field names/nullability), the `replace_schema` block will reject it. Thanks again for the help. I'll close this if we determine it's fully on our side. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
