Re: [PR] [SPARK-50206][SQL] Added separate collation id for UTF8_BINARY and non-collated strings [spark]

via GitHub Mon, 11 Nov 2024 00:27:08 -0800


cloud-fan commented on PR #48737:
URL: https://github.com/apache/spark/pull/48737#issuecomment-2467520611


   Since we can't reach an agreement, maybe we should pick a different 
approach. For StringType with default/undetermined collation, we want it to be 
the same as utf8 collation so that we won't break anything, but we also want it 
to have a special annotation so that we can determine the actual default 
collation later on.
   
   With the above requirement in mind, I think we should keep `object 
StringType` unchanged so that it's guaranteed that we won't break anything if 
users do not use string collation. We should mark StringType with explicit utf8 
collation with a special annotation so that we don't change it afterward.
   
   My new proposal is: in the parser, we return `object StringType` if the 
collation is not explicitly given, and return `new StringType(...)` when the 
collation is explicitly given. Later on, when we need to assign the actual 
default collation, we should find out string types that 
`stringType.eq(StringType) == true`, instead of `stringType.collationId == 0`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-50206][SQL] Added separate collation id for UTF8_BINARY and non-collated strings [spark]

Reply via email to