[PR] [WIP][CONNECT] Server side column name validation [spark]

via GitHub Tue, 08 Jul 2025 19:41:50 -0700


zhengruifeng opened a new pull request, #51409:
URL: https://github.com/apache/spark/pull/51409


   ### What changes were proposed in this pull request?
   Add column name validation in connect server side
   
   ### Why are the changes needed?
   For `df.col('bad_column')` or `df['bad_column']`, currently it is validated 
in this way:
   1, classic (both python and scala): eager validation;
   2, connect (python client): eager validation with cached schema, may trigger 
a RPC;
   3, connect (scala client): no eager validation, should fail in following 
analysis or execution, but may hit such edge issue:
   
   ```
   val df1 = sql("select * from values(1, 'y') as t1(a, y)")
   val df2 = sql("select * from values(1, 'x') as t2(a, x)")
   val df3 = df1.join(df2, df1("a") === df2("a"))
   val df4 = df3.select(df1("x"))
   ```
   
   `df1` doesn't contain `x` at all, it should fail, but the query actually 
succeeds with `col("x")`.
   That is due to a connect-specific column resolution approach that: in some 
case, if fail to resolve with plan id (`df1("x")`), resolve it without the plan 
id (`col("x")`)
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   this problematic query should fail
   
   
   ### How was this patch tested?
   added ut
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   no
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [WIP][CONNECT] Server side column name validation [spark]

Reply via email to