Re: [PR] [SPARK-51067][SQL] Revert session level collation for DML queries and apply object level collation for DDL queries [spark]

via GitHub Wed, 12 Feb 2025 08:35:48 -0800


dejankrak-db commented on code in PR #49772:
URL: https://github.com/apache/spark/pull/49772#discussion_r1953022318



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveDDLCommandStringTypes.scala:
##########
@@ -155,22 +123,22 @@ object ResolveDefaultStringTypes extends 
Rule[LogicalPlan] {
     dataType.existsRecursively(isDefaultStringType)
 
   private def isDefaultStringType(dataType: DataType): Boolean = {
+    // STRING (without explicit collation) is considered default string type.
+    // STRING COLLATE <collation_name> (with explicit collation) is not 
considered
+    // default string type even when explicit collation is UTF8_BINARY 
(default collation).
     dataType match {
-      case st: StringType =>
-        // should only return true for StringType object and not 
StringType("UTF8_BINARY")
-        st.eq(StringType) || st.isInstanceOf[TemporaryStringType]
+      // should only return true for StringType object and not for 
StringType("UTF8_BINARY")
+      case st: StringType => st.eq(StringType)

Review Comment:
   I changed the above line to case StringType => true, as suggested, and this 
caused the test "create/alter table with table level collation" in 
DefaultCollationTestSuite to fail at line 88, checking the following condition:
   assertTableColumnCollation(testTable, "c3", "UTF8_BINARY")
   
   The check expected column c3 data type to be StringType(UTF8_BINARY), or 
simply StringType, as explicitly set by the c3 STRING COLLATE UTF8_BINARY 
clause, whereas in fact the actual collation was UTF8_LCASE, as replaced by the 
specified table level collation.
   
   This is because with case st: StringType => st.eq(StringType), this returned 
false for StringType(UTF8_BINARY), presumably as it is a reference check where 
StringType and StringType(UTF8_BINARY) are deemed not equal, whereas with case 
StringType => true, this returned true for StringType(UTF8_BINARY), thus 
causing this column's string type to be replaced by table level collation 
string type StringType(UTF8_LCASE), which is incorrect behavior per ref spec.
   
   But, taking a step back, I feel that this is a relatively minor detail, and 
provided that we make the implementation work correctly as per ref spec 
(including edge cases), we should not block on this. Let me know what you 
think, thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-51067][SQL] Revert session level collation for DML queries and apply object level collation for DDL queries [spark]

Reply via email to