Re: [PR] [SPARK-46831][SQL] Collations - Extending StringType and PhysicalStringType with collationId field [spark]

via GitHub Mon, 29 Jan 2024 00:56:30 -0800


dbatomic commented on code in PR #44901:
URL: https://github.com/apache/spark/pull/44901#discussion_r1469260713



##########
sql/api/src/main/scala/org/apache/spark/sql/types/StringType.scala:
##########
@@ -23,9 +23,10 @@ import org.apache.spark.annotation.Stable
  * The data type representing `String` values. Please use the singleton 
`DataTypes.StringType`.
  *
  * @since 1.3.0
+ * @param collationId The id of collation for this StringType.
  */
 @Stable
-class StringType private() extends AtomicType {
+class StringType private(val collationId: Int) extends AtomicType {

Review Comment:
   Sure, overall design is captured in the design doc that comes with JIRA 
ticket, but let me write reasoning here as well.
   
   Reasons are following:
   1) CollationId will be serializable. When we get to the point of marking 
column with collation, information will need to be persisted.
   2) In future there will be thousands of possible collation combinations (all 
locales (800+) X case sensitivity X accent sensitivity X trimming).
   3) We could go with an enum, but I think that enums are not well suited for 
such large collections.
   4) This will have to work with Photon as well, or any other engine - having 
simple integer that points to the collation rules looks like simple 
implementation that can be easily mimicked in other engines.
   
   Of course, this is just my reasoning. I would appreciate your thoughts on 
this.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-46831][SQL] Collations - Extending StringType and PhysicalStringType with collationId field [spark]

Reply via email to