vbarua commented on code in PR #17299: URL: https://github.com/apache/datafusion/pull/17299#discussion_r2323596891
########## datafusion/substrait/tests/cases/logical_plans.rs: ########## @@ -144,6 +144,47 @@ mod tests { Ok(()) } + #[tokio::test] + async fn null_literal_before_and_after_joins() -> Result<()> { + // Confirms that literals used before and after a join but for different columns + // are correctly handled. + + // File generated with substrait-java's Isthmus: + // ./isthmus-cli/build/graal/isthmus --create "create table A (a int); create table B (a int, c int); create table C (a int, d int)" "select t.*, C.d, CAST(NULL AS VARCHAR) as e from (select a, CAST(NULL AS VARCHAR) as c from A UNION ALL select a, c from B) t LEFT JOIN C ON t.a = C.a" + let proto_plan = + read_json("tests/testdata/test_plans/null_literals.substrait.json"); Review Comment: minor: I would suggest naming the test and test file based on the property that we're trying to test, which is disambiguation of duplicate literal names in plans, instead of the contents of the file. A plan named `null_literal` doesn't really provide much information about what we're testing. ########## datafusion/substrait/src/logical_plan/consumer/rel/project_rel.rs: ########## @@ -62,7 +62,17 @@ pub async fn from_project_rel( // to transform it into a column reference window_exprs.insert(e.clone()); } - explicit_exprs.push(name_tracker.get_uniquely_named_expr(e)?); + // Since substrait removes aliases, we need to assign literals with a UUID alias to avoid + // ambiguous names when the same literal is used before and after a join. + // The name tracker will ensure that two literals in the same project would have + // unique names but, it does not ensure that if a literal column exists in a previous + // project say before a join that it is deduplicated with respect to those columns. + // See: https://github.com/apache/datafusion/pull/17299 + let maybe_apply_alias = match e { + lit @ Expr::Literal(_, _) => lit.alias(uuid::Uuid::new_v4().to_string()), Review Comment: The one thing that's a bit wonky about this is that the usage of UUIDs injects a little bit of randomness into the conversion from Substrait plan to Datafusion plan. You're already dealing with this in your tests by using your filter for eliding UUID values: ```rust let mut settings = insta::Settings::clone_current(); settings.add_filter( r"[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}", "[UUID]", ); ``` Using UUIDs makes it easy to guarantee that names are unique, but make the plans less readable and can complicate testing. Figuring out a deterministic scheme for this would be nice. We could potentially apply the name tracker to the inputs of multi-input relations (i.e. JoinRel, CrossRel, SetRel). That's not as nice as the UUID solution you have because it would require extra handling in every multi-input relation, but it could potentially improve readability. I'm not wedded to this though. The UUID solutions works well enough and maybe in practice it won't be that much of an issue. If it does, we can always tweak the plan conversion later. ########## datafusion/substrait/src/logical_plan/consumer/rel/project_rel.rs: ########## @@ -62,7 +62,17 @@ pub async fn from_project_rel( // to transform it into a column reference window_exprs.insert(e.clone()); } - explicit_exprs.push(name_tracker.get_uniquely_named_expr(e)?); + // Since substrait removes aliases, we need to assign literals with a UUID alias to avoid + // ambiguous names when the same literal is used before and after a join. + // The name tracker will ensure that two literals in the same project would have + // unique names but, it does not ensure that if a literal column exists in a previous + // project say before a join that it is deduplicated with respect to those columns. Review Comment: To clarify my understanding of this. The name tracker guarantees name uniqueness in the output names of the relation it is applied to (Projects, Aggregates). In cases of relations that consume multiple inputs (Joins, Unions), if the individual inputs have names that are unique but duplicated between them, we get duplicate name issues in the output schema when we combine the input schemas? ########## datafusion/substrait/src/logical_plan/consumer/rel/project_rel.rs: ########## @@ -62,7 +62,17 @@ pub async fn from_project_rel( // to transform it into a column reference window_exprs.insert(e.clone()); } - explicit_exprs.push(name_tracker.get_uniquely_named_expr(e)?); + // Since substrait removes aliases, we need to assign literals with a UUID alias to avoid + // ambiguous names when the same literal is used before and after a join. Review Comment: In what context does "substrait removes aliases"? When converting from Datafusion to Substrait? This is a bit a of a weird statement to me because Substrait doesn't care about names at all. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org