Blizzara opened a new pull request, #12325:
URL: https://github.com/apache/datafusion/pull/12325

   ## Which issue does this PR close?
   
   Fixes a correctness issue when consuming Substrait plans against tables that 
don't exactly match in schema. Currently DF may in those cases use wrong 
columns due to Substrait code operating on just column _indices_, not names.
   
   Kind of addresses https://github.com/apache/datafusion/issues/12223, though 
this isn't exactly what the issue asks for.
   
   The difference is this PR checks that the Substrait schema either matches, 
_or is a subset of_, the table schema. It also fixes the correctness issue by 
ensuring we select only the columns Substrait expects us to have.
   
   IMO, requiring strict match of the schema is harmful, since there's no 
problem in reading a dataset that has more columns than what the plan needs. 
That requirement breaks using a once-generated Substrait plan across an 
evolving (but not breaking) input dataset, which seems counterproductive. 
   
   Also it breaks with e.g. current substrait-spark produced plans, since those 
only include in the base_schema the columns that Spark considered necessary for 
the plan (that comes from the use of [plan.schema 
here](https://github.com/substrait-io/substrait-java/blob/c8c31ecd3da047a4de0c33b21f3e960df628bda0/spark/src/main/scala/io/substrait/spark/logical/ToSubstraitRel.scala#L334)).
 One could argue substrait-spark should be fixed, and I'll look into that, but 
it'll take some time as the right way then to represent this is with a 
projection but substrait-java doesn't support those yet.
   
   ## Rationale for this change
   
   Fixes a correctness issue when consuming Substrait against tables that don't 
match exactly the base_schema defined in Substrait.
   
   ## What changes are included in this PR?
   
   Adds a schema check for input tables, if the check passes all good, if not 
we try to select a matching subset of columns to ensure a match. If some column 
isn't found or has wrong type, throws.
   
   ## Are these changes tested?
   
   Existing tests
   
   ## Are there any user-facing changes?
   
   Fixes wrong and undefined behavior; working plans shouldn't be affected.
   
   <!--
   If there are user-facing changes then we may require documentation to be 
updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please add the `api 
change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to