Re: [PR] [SPARK-51067][SQL] Revert session level collation for DML queries and apply object level collation for DDL queries [spark]

via GitHub Wed, 12 Feb 2025 07:58:12 -0800


dejankrak-db commented on code in PR #49772:
URL: https://github.com/apache/spark/pull/49772#discussion_r1952921268



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveDDLCommandStringTypes.scala:
##########
@@ -155,22 +123,22 @@ object ResolveDefaultStringTypes extends 
Rule[LogicalPlan] {
     dataType.existsRecursively(isDefaultStringType)
 
   private def isDefaultStringType(dataType: DataType): Boolean = {
+    // STRING (without explicit collation) is considered default string type.
+    // STRING COLLATE <collation_name> (with explicit collation) is not 
considered
+    // default string type even when explicit collation is UTF8_BINARY 
(default collation).
     dataType match {
-      case st: StringType =>
-        // should only return true for StringType object and not 
StringType("UTF8_BINARY")
-        st.eq(StringType) || st.isInstanceOf[TemporaryStringType]
+      // should only return true for StringType object and not for 
StringType("UTF8_BINARY")
+      case st: StringType => st.eq(StringType)
       case _ => false
     }
   }
 
   private def replaceDefaultStringType(dataType: DataType, newType: 
StringType): DataType = {
+    // Should replace STRING with the new type.
+    // Should not replace STRING COLLATE UTF8_BINARY, as that is explicit 
collation.
     dataType.transformRecursively {
       case currentType: StringType if isDefaultStringType(currentType) =>
-        if (currentType == newType) {
-          TemporaryStringType()
-        } else {
-          newType
-        }
+        newType

Review Comment:
   We don't need RuleExecutor.forceAdditionalIteration anymore, so I have 
removed it altogether from the code, per other comment as well.
   
   If newType is StringType(UTF8_BINARY), that won't be an issue, as the only 
potential candidates for replacement are default string types, i.e. StringType, 
whose collation is UTF8_BINARY by default. Hence, even if we skip them at that 
point, their collation would remain accurate.
   Now, if at some point later the object level collation gets changed to 
non-UTF8_BINARY collation (e.g. ALTER TABLE foo DEFAULT COLLATION UNICODE), 
this will only apply to the columns added from that point onwards, whereas the 
existing columns remain unaffected (per ref spec), i.e. their collation has 
already previously been stamped, and the only way to change it would be through 
ALTER TABLE foo ALTER COLUMN c1 STRING COLLATE UNICODE, which is handled 
separately though existing column-level collation logic.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-51067][SQL] Revert session level collation for DML queries and apply object level collation for DDL queries [spark]

Reply via email to