dbatomic commented on code in PR #44901:
URL: https://github.com/apache/spark/pull/44901#discussion_r1469260713
##########
sql/api/src/main/scala/org/apache/spark/sql/types/StringType.scala:
##########
@@ -23,9 +23,10 @@ import org.apache.spark.annotation.Stable
* The data type representing `String` values. Please use the singleton
`DataTypes.StringType`.
*
* @since 1.3.0
+ * @param collationId The id of collation for this StringType.
*/
@Stable
-class StringType private() extends AtomicType {
+class StringType private(val collationId: Int) extends AtomicType {
Review Comment:
Sure, overall design is captured in the design doc that comes with JIRA
ticket, but let me write reasoning here as well.
Reasons are following:
1) CollationId will be serializable. When we get to the point of marking
column with collation, information will need to be persisted.
2) In future there will be thousands of possible collation combinations (all
locales (800+) X case sensitivity X accent sensitivity X trimming).
3) We could go with an enum, but I think that enums are not well suited for
such large collections.
4) This will have to work with Photon as well, or any other engine - having
simple integer that points to the collation rules looks like simple
implementation that can be easily mimicked in other engines.
Of course, this is just my reasoning. I would appreciate your thoughts on
this.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]