[ 
https://issues.apache.org/jira/browse/SPARK-57223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated SPARK-57223:
------------------------------
    Description: 
*Symptom*

Error messages involving a column or struct field whose name contains a dot or 
space display the identifier incorrectly. For example, for a column named 
{{{}job.title{}}}:
{noformat}
Fail to assign a value of "INT" type to the "SMALLINT" type column or variable 
`job`.`title`
{noformat}
Expected:
{noformat}
Fail to assign a value of "INT" type to the "SMALLINT" type column or variable 
`job.title`
{noformat}
*Root Cause*

{{DataTypeErrorsBase.toSQLId(parts: String)}} passes its input through 
{{{}AttributeNameParser.parseAttributeName{}}}, which interprets the string as 
SQL text and splits on unquoted dots:
{code:scala}
def toSQLId(parts: String): String =
  toSQLId(AttributeNameParser.parseAttributeName(parts))  // "job.title" -> 
Seq("job","title")
{code}
The String overload is intended for user-typed SQL identifier text. However, 
many call sites pass verbatim stored names from the internal object model – 
{{{}toSQLId(col.name){}}}, {{{}toSQLId(attr.name){}}}, {{{}toSQLId(f.name){}}}, 
etc. – where the string is a raw single-part name that may contain a dot or 
space.

*Fix option #1:* change the String overload body to 
{{{}toSQLId(Seq(parts)){}}}. All verbatim-name call sites are fixed 
automatically. However, this approach touches all toSQLId callers which is 
quite invasive and risky.

*Some pre-existing Misuse of toSQLId(String) (E1-E6)*

Six call sites pass already-formatted SQL strings and currently work by 
accident. They must be corrected before the overload change, or addressed as a 
follow-up. All are in error-message-only paths – no execution risk.
|| #  ||Location||Argument||Why it is wrong||Correct fix||
|E1|CheckAnalysis.scala:441,446,1213,1221,1236,1239,1242,1246,1251,1266|table.name
 / r.name|ResolvedTable.name is already a pre-formatted string|Drop toSQLId; 
use table.name directly|
|E2|CheckAnalysis.scala:875,883|column.quoted|Array[String] pre-joined to a 
formatted string before passing|Store as column.toImmutableArraySeq so toSQLId 
uses the Seq overload|
|E3|CheckAnalysis.scala:~1227|col.name.quoted|Seq[String] pre-formatted; 
fieldName also used as bare String at line 1277 (compilation coupling)|Use 
col.name (Seq[String]) for toSQLId calls; keep col.name.quoted at line 1277|
|E4|QueryCompilationErrors.scala:728,1318|FunctionIdentifier.toString|Unquoted 
but dot-joined; mis-splits if name contains a dot|Use x.nameParts|
|E5|QueryCompilationErrors.scala:2982,2994|TableIdentifier.toString|Fully 
backtick-quoted (inherits quotedString); would double-quote|Use 
tableIdent.nameParts|
|E6|JDBCTable.scala:84,107|ident.toString|Conditionally-quoted dot-joined 
multi-part JDBC name|Use toSQLId(ident.namespace().toSeq :+ ident.name())|

{*}Fix option #2{*}: surgical fix which only change how toSQLId is used for 
user-facing error messages. Scope/Risk is limited. See 
[https://github.com/apache/spark/pull/56351] 

  was:
*Symptom*

Error messages involving a column or struct field whose name contains a dot or 
space display the identifier incorrectly. For example, for a column named 
{{{}job.title{}}}:
{noformat}
Fail to assign a value of "INT" type to the "SMALLINT" type column or variable 
`job`.`title`
{noformat}
Expected:
{noformat}
Fail to assign a value of "INT" type to the "SMALLINT" type column or variable 
`job.title`
{noformat}
*Root Cause*

{{DataTypeErrorsBase.toSQLId(parts: String)}} passes its input through 
{{{}AttributeNameParser.parseAttributeName{}}}, which interprets the string as 
SQL text and splits on unquoted dots:
{code:scala}
def toSQLId(parts: String): String =
  toSQLId(AttributeNameParser.parseAttributeName(parts))  // "job.title" -> 
Seq("job","title")
{code}
The String overload is intended for user-typed SQL identifier text. However, 
many call sites pass verbatim stored names from the internal object model – 
{{{}toSQLId(col.name){}}}, {{{}toSQLId(attr.name){}}}, {{{}toSQLId(f.name){}}}, 
etc. – where the string is a raw single-part name that may contain a dot or 
space.

*Fix option #1:* change the String overload body to 
{{{}toSQLId(Seq(parts)){}}}. All verbatim-name call sites are fixed 
automatically. 

*Some pre-existing Misuse of toSQLId(String) (E1-E6)*

Six call sites pass already-formatted SQL strings and currently work by 
accident. They must be corrected before the overload change, or addressed as a 
follow-up. All are in error-message-only paths – no execution risk.
|| #  ||Location||Argument||Why it is wrong||Correct fix||
|E1|CheckAnalysis.scala:441,446,1213,1221,1236,1239,1242,1246,1251,1266|table.name
 / r.name|ResolvedTable.name is already a pre-formatted string|Drop toSQLId; 
use table.name directly|
|E2|CheckAnalysis.scala:875,883|column.quoted|Array[String] pre-joined to a 
formatted string before passing|Store as column.toImmutableArraySeq so toSQLId 
uses the Seq overload|
|E3|CheckAnalysis.scala:~1227|col.name.quoted|Seq[String] pre-formatted; 
fieldName also used as bare String at line 1277 (compilation coupling)|Use 
col.name (Seq[String]) for toSQLId calls; keep col.name.quoted at line 1277|
|E4|QueryCompilationErrors.scala:728,1318|FunctionIdentifier.toString|Unquoted 
but dot-joined; mis-splits if name contains a dot|Use x.nameParts|
|E5|QueryCompilationErrors.scala:2982,2994|TableIdentifier.toString|Fully 
backtick-quoted (inherits quotedString); would double-quote|Use 
tableIdent.nameParts|
|E6|JDBCTable.scala:84,107|ident.toString|Conditionally-quoted dot-joined 
multi-part JDBC name|Use toSQLId(ident.namespace().toSeq :+ ident.name())|

Fix option #2: surgical fix which only change how toSQLId is used for 
user-facing error messages. Scope/Risk is limited. See 
[https://github.com/apache/spark/pull/56351] 


> toSQLId(String) misrepresents column/field names containing dots or spaces in 
> error messages
> --------------------------------------------------------------------------------------------
>
>                 Key: SPARK-57223
>                 URL: https://issues.apache.org/jira/browse/SPARK-57223
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 5.0.0
>            Reporter: Eric Yang
>            Priority: Major
>              Labels: pull-request-available
>
> *Symptom*
> Error messages involving a column or struct field whose name contains a dot 
> or space display the identifier incorrectly. For example, for a column named 
> {{{}job.title{}}}:
> {noformat}
> Fail to assign a value of "INT" type to the "SMALLINT" type column or 
> variable `job`.`title`
> {noformat}
> Expected:
> {noformat}
> Fail to assign a value of "INT" type to the "SMALLINT" type column or 
> variable `job.title`
> {noformat}
> *Root Cause*
> {{DataTypeErrorsBase.toSQLId(parts: String)}} passes its input through 
> {{{}AttributeNameParser.parseAttributeName{}}}, which interprets the string 
> as SQL text and splits on unquoted dots:
> {code:scala}
> def toSQLId(parts: String): String =
>   toSQLId(AttributeNameParser.parseAttributeName(parts))  // "job.title" -> 
> Seq("job","title")
> {code}
> The String overload is intended for user-typed SQL identifier text. However, 
> many call sites pass verbatim stored names from the internal object model – 
> {{{}toSQLId(col.name){}}}, {{{}toSQLId(attr.name){}}}, 
> {{{}toSQLId(f.name){}}}, etc. – where the string is a raw single-part name 
> that may contain a dot or space.
> *Fix option #1:* change the String overload body to 
> {{{}toSQLId(Seq(parts)){}}}. All verbatim-name call sites are fixed 
> automatically. However, this approach touches all toSQLId callers which is 
> quite invasive and risky.
> *Some pre-existing Misuse of toSQLId(String) (E1-E6)*
> Six call sites pass already-formatted SQL strings and currently work by 
> accident. They must be corrected before the overload change, or addressed as 
> a follow-up. All are in error-message-only paths – no execution risk.
> || #  ||Location||Argument||Why it is wrong||Correct fix||
> |E1|CheckAnalysis.scala:441,446,1213,1221,1236,1239,1242,1246,1251,1266|table.name
>  / r.name|ResolvedTable.name is already a pre-formatted string|Drop toSQLId; 
> use table.name directly|
> |E2|CheckAnalysis.scala:875,883|column.quoted|Array[String] pre-joined to a 
> formatted string before passing|Store as column.toImmutableArraySeq so 
> toSQLId uses the Seq overload|
> |E3|CheckAnalysis.scala:~1227|col.name.quoted|Seq[String] pre-formatted; 
> fieldName also used as bare String at line 1277 (compilation coupling)|Use 
> col.name (Seq[String]) for toSQLId calls; keep col.name.quoted at line 1277|
> |E4|QueryCompilationErrors.scala:728,1318|FunctionIdentifier.toString|Unquoted
>  but dot-joined; mis-splits if name contains a dot|Use x.nameParts|
> |E5|QueryCompilationErrors.scala:2982,2994|TableIdentifier.toString|Fully 
> backtick-quoted (inherits quotedString); would double-quote|Use 
> tableIdent.nameParts|
> |E6|JDBCTable.scala:84,107|ident.toString|Conditionally-quoted dot-joined 
> multi-part JDBC name|Use toSQLId(ident.namespace().toSeq :+ ident.name())|
> {*}Fix option #2{*}: surgical fix which only change how toSQLId is used for 
> user-facing error messages. Scope/Risk is limited. See 
> [https://github.com/apache/spark/pull/56351] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to