Re: [PR] Fix ambiguous column names in substrait conversion as a result of literals having the same name during conversion. [datafusion]

via GitHub Thu, 04 Sep 2025 15:03:28 -0700


vbarua commented on code in PR #17299:
URL: https://github.com/apache/datafusion/pull/17299#discussion_r2323596891



##########
datafusion/substrait/tests/cases/logical_plans.rs:
##########
@@ -144,6 +144,47 @@ mod tests {
         Ok(())
     }
 
+    #[tokio::test]
+    async fn null_literal_before_and_after_joins() -> Result<()> {
+        // Confirms that literals used before and after a join but for 
different columns
+        // are correctly handled.
+
+        // File generated with substrait-java's Isthmus:
+        // ./isthmus-cli/build/graal/isthmus --create "create table A (a int); 
create table B (a int, c int); create table C (a int, d int)" "select t.*, C.d, 
CAST(NULL AS VARCHAR) as e from (select a, CAST(NULL AS VARCHAR) as c from A 
UNION ALL select a, c from B) t LEFT JOIN C ON t.a = C.a"
+        let proto_plan =
+            
read_json("tests/testdata/test_plans/null_literals.substrait.json");

Review Comment:
   minor: I would suggest naming the test and test file based on the property 
that we're trying to test, which is disambiguation of duplicate literal names 
in plans, instead of the contents of the file. A plan named `null_literal`  
doesn't really provide much information about what we're testing.
   



##########
datafusion/substrait/src/logical_plan/consumer/rel/project_rel.rs:
##########
@@ -62,7 +62,17 @@ pub async fn from_project_rel(
                 // to transform it into a column reference
                 window_exprs.insert(e.clone());
             }
-            explicit_exprs.push(name_tracker.get_uniquely_named_expr(e)?);
+            // Since substrait removes aliases, we need to assign literals 
with a UUID alias to avoid
+            // ambiguous names when the same literal is used before and after 
a join.
+            // The name tracker will ensure that two literals in the same 
project would have
+            // unique names but, it does not ensure that if a literal column 
exists in a previous
+            // project say before a join that it is deduplicated with respect 
to those columns.
+            // See: https://github.com/apache/datafusion/pull/17299
+            let maybe_apply_alias = match e {
+                lit @ Expr::Literal(_, _) => 
lit.alias(uuid::Uuid::new_v4().to_string()),

Review Comment:
   The one thing that's a bit wonky about this is that the usage of UUIDs 
injects a little bit of randomness into the conversion from Substrait plan to 
Datafusion plan. You're already dealing with this in your tests by using your 
filter for eliding UUID values:
   ```rust
           let mut settings = insta::Settings::clone_current();
           settings.add_filter(
               r"[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}",
               "[UUID]",
           );
   ```
   
   Using UUIDs makes it easy to guarantee that names are unique, but make the 
plans less readable and can complicate testing.
   
   Figuring out a deterministic scheme for this would be nice. We could 
potentially apply the name tracker to the inputs of multi-input relations (i.e. 
JoinRel, CrossRel, SetRel). That's not as nice as the UUID solution you have 
because it would require extra handling in every multi-input relation, but it 
could potentially improve readability. I'm not wedded to this though. The UUID 
solutions works well enough and maybe in practice it won't be that much of an 
issue. If it does, we can always tweak the plan conversion later.



##########
datafusion/substrait/src/logical_plan/consumer/rel/project_rel.rs:
##########
@@ -62,7 +62,17 @@ pub async fn from_project_rel(
                 // to transform it into a column reference
                 window_exprs.insert(e.clone());
             }
-            explicit_exprs.push(name_tracker.get_uniquely_named_expr(e)?);
+            // Since substrait removes aliases, we need to assign literals 
with a UUID alias to avoid
+            // ambiguous names when the same literal is used before and after 
a join.
+            // The name tracker will ensure that two literals in the same 
project would have
+            // unique names but, it does not ensure that if a literal column 
exists in a previous
+            // project say before a join that it is deduplicated with respect 
to those columns.

Review Comment:
   To clarify my understanding of this. The name tracker guarantees name 
uniqueness in the output names of the relation it is applied to (Projects, 
Aggregates). In cases of relations that consume multiple inputs (Joins, 
Unions), if the individual inputs have names that are unique but duplicated 
between them, we get duplicate name issues in the output schema when we combine 
the input schemas?
   
   



##########
datafusion/substrait/src/logical_plan/consumer/rel/project_rel.rs:
##########
@@ -62,7 +62,17 @@ pub async fn from_project_rel(
                 // to transform it into a column reference
                 window_exprs.insert(e.clone());
             }
-            explicit_exprs.push(name_tracker.get_uniquely_named_expr(e)?);
+            // Since substrait removes aliases, we need to assign literals 
with a UUID alias to avoid
+            // ambiguous names when the same literal is used before and after 
a join.

Review Comment:
   In what context does "substrait removes aliases"? When converting from 
Datafusion to Substrait? This is a bit a of a weird statement to me because 
Substrait doesn't care about names at all.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Fix ambiguous column names in substrait conversion as a result of literals having the same name during conversion. [datafusion]

Reply via email to