Re: [PR] fix(substrait): remove optimize calls from substrait consumer [datafusion]

via GitHub Wed, 09 Oct 2024 21:27:41 -0700


tokoko commented on code in PR #12800:
URL: https://github.com/apache/datafusion/pull/12800#discussion_r1794614220



##########
datafusion/substrait/src/logical_plan/consumer.rs:
##########
@@ -984,30 +984,64 @@ pub async fn from_substrait_rel(
 /// 1. All fields present in the Substrait schema are present in the 
DataFusion schema. The
 ///    DataFusion schema may have MORE fields, but not the other way around.
 /// 2. All fields are compatible. See [`ensure_field_compatability`] for 
details
-///
-/// This function returns a DataFrame with fields adjusted if necessary in the 
event that the
-/// Substrait schema is a subset of the DataFusion schema.
 fn ensure_schema_compatability(
-    table: DataFrame,
+    table_schema: DFSchema,
     substrait_schema: DFSchema,
-) -> Result<DataFrame> {
-    let df_schema = table.schema().to_owned().strip_qualifiers();
-    if df_schema.logically_equivalent_names_and_types(&substrait_schema) {
-        return Ok(table);
-    }
-    let selected_columns = substrait_schema
+) -> Result<()> {
+    substrait_schema
         .strip_qualifiers()
         .fields()
         .iter()
         .map(|substrait_field| {
             let df_field =
-                df_schema.field_with_unqualified_name(substrait_field.name())?;
-            ensure_field_compatability(df_field, substrait_field)?;
-            Ok(col(format!("\"{}\"", df_field.name())))
+                
table_schema.field_with_unqualified_name(substrait_field.name())?;
+            ensure_field_compatability(df_field, substrait_field)
         })
-        .collect::<Result<_>>()?;
+        .collect::<Result<()>>()?;
+
+    Ok(())
+}
+
+/// This function returns a DataFrame with fields adjusted if necessary in the 
event that the
+/// Substrait schema is a subset of the DataFusion schema.
+fn apply_projection(table: DataFrame, substrait_schema: DFSchema) -> 
Result<LogicalPlan> {

Review Comment:
   > this might be more scope than you intended to take and might be better as 
a follow-up
   
   I think I crossed that line a long ago, it's fine :laughing: 
   
   Can you clarify how is this different from the current flow (aside from it 
being refactored into another function)? My thought was also to eventually add 
it 
[here](https://github.com/apache/datafusion/blob/e8b2b2a695e657034824200f9f2403c01179a3fd/datafusion/substrait/src/logical_plan/consumer.rs#L769)
 before calling `apply_masking` on the schema. We can always obtain a TableScan 
from a df and mutate it. then `apply_projection` would be changed to accept 
TableScan directly instead of a DataFrame.
   
   Or maybe I'm misunderstanding your point. Are you challenging the fact that 
masking is applied to the schema first, which is then applied to a TableScan? 
The rationale there was that base_schema and masking are both needed in 
conjunction to come up with column indices and if I did in two different steps, 
I'd have to remap already added projections which seemed more error-prone.
   
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] fix(substrait): remove optimize calls from substrait consumer [datafusion]

Reply via email to