[GitHub] [arrow-datafusion] Jefffrey commented on a diff in pull request #5509: Enforce ambiguity check whilst normalizing columns

via GitHub Tue, 07 Mar 2023 11:48:35 -0800


Jefffrey commented on code in PR #5509:
URL: https://github.com/apache/arrow-datafusion/pull/5509#discussion_r1128477085



##########
datafusion/common/src/column.rs:
##########
@@ -154,6 +158,105 @@ impl Column {
                 .collect(),
         }))
     }
+
+    /// Qualify column if not done yet.
+    ///
+    /// If this column already has a [relation](Self::relation), it will be 
returned as is and the given parameters are
+    /// ignored. Otherwise this will search through the given schemas to find 
the column.
+    ///
+    /// Will check for ambiguity at each level of `schemas`.
+    ///
+    /// A schema matches if there is a single column that -- when unqualified 
-- matches this column. There is an
+    /// exception for `USING` statements, see below.
+    ///
+    /// # Using columns
+    /// Take the following SQL statement:
+    ///
+    /// ```sql
+    /// SELECT id FROM t1 JOIN t2 USING(id)
+    /// ```
+    ///
+    /// In this case, both `t1.id` and `t2.id` will match unqualified column 
`id`. To express this possibility, use
+    /// `using_columns`. Each entry in this array is a set of columns that are 
bound together via a `USING` clause. So
+    /// in this example this would be `[{t1.id, t2.id}]`.
+    ///
+    /// Regarding ambiguity check, `schemas` is structured to allow levels of 
schemas to be passed in.
+    /// For example:
+    ///
+    /// ```text
+    /// schemas = &[
+    ///    &[schema1, schema2], // first level
+    ///    &[schema3, schema4], // second level
+    /// ]
+    /// ```
+    ///
+    /// Will search for a matching field in all schemas in the first level. If 
a matching field according to above
+    /// mentioned conditions is not found, then will check the next level. If 
found more than one matching column across
+    /// all schemas in a level, that isn't a USING column, will return an 
error due to ambiguous column.
+    ///
+    /// If checked all levels and couldn't find field, will return field not 
found error.
+    pub fn normalize_with_schemas_and_ambiguity_check(
+        self,
+        schemas: &[&[&DFSchema]],

Review Comment:
   kinda ugly syntax here. it makes the logic work, but maybe could have 
something better?



##########
datafusion/common/src/column.rs:
##########
@@ -102,6 +102,10 @@ impl Column {
     /// In this case, both `t1.id` and `t2.id` will match unqualified column 
`id`. To express this possibility, use
     /// `using_columns`. Each entry in this array is a set of columns that are 
bound together via a `USING` clause. So
     /// in this example this would be `[{t1.id, t2.id}]`.
+    #[deprecated(
+        since = "20.0.0",
+        note = "use normalize_with_schemas_and_ambiguity_check instead"
+    )]

Review Comment:
   ive marked old methods as deprecated as unsure if should remove (breaking 
api change technically?)



##########
datafusion/substrait/src/logical_plan/consumer.rs:
##########
@@ -505,10 +505,14 @@ pub async fn from_substrait_rex(
                         "Direct reference StructField with child is not 
supported"
                             .to_string(),
                     )),
-                    None => Ok(Arc::new(Expr::Column(Column {
-                        relation: None,
-                        name: input_schema.field(x.field as 
usize).name().to_string(),
-                    }))),
+                    None => {
+                        let column =
+                            input_schema.field(x.field as 
usize).qualified_column();
+                        Ok(Arc::new(Expr::Column(Column {
+                            relation: column.relation,
+                            name: column.name,
+                        })))
+                    }

Review Comment:
   this is a separate substrait fix, since it previously wasn't qualifying the 
column built even if the column itself was qualified



##########
datafusion/expr/src/logical_plan/plan.rs:
##########
@@ -174,7 +174,26 @@ impl LogicalPlan {
         }
     }
 
+    /// Used for normalizing columns, as the fallback schemas to the main 
schema
+    /// of the plan.
+    pub fn fallback_normalize_schemas(&self) -> Vec<&DFSchema> {
+        match self {
+            LogicalPlan::Window(_)
+            | LogicalPlan::Projection(_)
+            | LogicalPlan::Aggregate(_)
+            | LogicalPlan::Unnest(_)
+            | LogicalPlan::Join(_)
+            | LogicalPlan::CrossJoin(_) => self
+                .inputs()
+                .iter()
+                .map(|input| input.schema().as_ref())
+                .collect(),
+            _ => vec![],
+        }
+    }

Review Comment:
   since previous column normalization depended on all_schemas, which i still 
dont fully understand, this method extracts the part from all_schemas which 
returns the self & the child schemas, but only returns child schema
   
   meant to be used as a fallback to the new normalization method



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] Jefffrey commented on a diff in pull request #5509: Enforce ambiguity check whilst normalizing columns

Reply via email to