cloud-fan commented on code in PR #48737:
URL: https://github.com/apache/spark/pull/48737#discussion_r1833535951
##########
sql/api/src/main/scala/org/apache/spark/sql/types/StringType.scala:
##########
@@ -83,9 +83,18 @@ class StringType private (val collationId: Int) extends
AtomicType with Serializ
override def jsonValue: JValue = JString("string")
override def equals(obj: Any): Boolean =
Review Comment:
StirngType with un-determined collation is and should be different from
StringType with udf8 collation. It's very dangerous to treat them the same at
such a low level.
Let's list the places that should treat them the same:
1. Spark internal type coercion rules
2. Spark internal collation-aware functions/operators
3. external code that needs to deal with string type (UDF, data source)
For 1 and 2, I think most code goes to `CollationFactory` and we don't need
to update a lot of places. For 3, there won't be any problem by default as
collation is not enabled. The existing code still works as all `StringType` has
un-determined collation. If users turn on collation, the existing code is
broken anyway as they need to handle different collations.
Eventually, when collation is enabled, we should not have un-determined
collation and all the string types should have a collation determined by the
context (session default, or table/schema/catalog default).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]