Blizzara opened a new pull request, #12325: URL: https://github.com/apache/datafusion/pull/12325
## Which issue does this PR close? Fixes a correctness issue when consuming Substrait plans against tables that don't exactly match in schema. Currently DF may in those cases use wrong columns due to Substrait code operating on just column _indices_, not names. Kind of addresses https://github.com/apache/datafusion/issues/12223, though this isn't exactly what the issue asks for. The difference is this PR checks that the Substrait schema either matches, _or is a subset of_, the table schema. It also fixes the correctness issue by ensuring we select only the columns Substrait expects us to have. IMO, requiring strict match of the schema is harmful, since there's no problem in reading a dataset that has more columns than what the plan needs. That requirement breaks using a once-generated Substrait plan across an evolving (but not breaking) input dataset, which seems counterproductive. Also it breaks with e.g. current substrait-spark produced plans, since those only include in the base_schema the columns that Spark considered necessary for the plan (that comes from the use of [plan.schema here](https://github.com/substrait-io/substrait-java/blob/c8c31ecd3da047a4de0c33b21f3e960df628bda0/spark/src/main/scala/io/substrait/spark/logical/ToSubstraitRel.scala#L334)). One could argue substrait-spark should be fixed, and I'll look into that, but it'll take some time as the right way then to represent this is with a projection but substrait-java doesn't support those yet. ## Rationale for this change Fixes a correctness issue when consuming Substrait against tables that don't match exactly the base_schema defined in Substrait. ## What changes are included in this PR? Adds a schema check for input tables, if the check passes all good, if not we try to select a matching subset of columns to ensure a match. If some column isn't found or has wrong type, throws. ## Are these changes tested? Existing tests ## Are there any user-facing changes? Fixes wrong and undefined behavior; working plans shouldn't be affected. <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. --> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org