Re: [PR] [SPARK-50206][SQL] Added separate collation id for UTF8_BINARY and non-collated strings [spark]

via GitHub Thu, 07 Nov 2024 16:27:21 -0800


cloud-fan commented on code in PR #48737:
URL: https://github.com/apache/spark/pull/48737#discussion_r1833535951



##########
sql/api/src/main/scala/org/apache/spark/sql/types/StringType.scala:
##########
@@ -83,9 +83,18 @@ class StringType private (val collationId: Int) extends 
AtomicType with Serializ
   override def jsonValue: JValue = JString("string")
 
   override def equals(obj: Any): Boolean =

Review Comment:
   StirngType with un-determined collation is and should be different from 
StringType with udf8 collation. It's very dangerous to treat them the same at 
such a low level.
   
   Let's list the places that should treat them the same:
   1. Spark internal type coercion rules
   2. Spark internal collation-aware functions/operators
   3. external code that needs to deal with string type (UDF, data source)
   
   For 1 and 2, I think most code goes to `CollationFactory` and we don't need 
to update a lot of places. For 3, there won't be any problem by default as 
collation is not enabled. The existing code still works as all `StringType` has 
un-determined collation. If users turn on collation, the existing code is 
broken anyway as they need to handle different collations.
   
   Eventually, when collation is enabled, we should not have un-determined 
collation and all the string types should have a collation determined by the 
context (session default, or table/schema/catalog default).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-50206][SQL] Added separate collation id for UTF8_BINARY and non-collated strings [spark]

Reply via email to