dejankrak-db commented on code in PR #49772:
URL: https://github.com/apache/spark/pull/49772#discussion_r1953022318
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveDDLCommandStringTypes.scala:
##########
@@ -155,22 +123,22 @@ object ResolveDefaultStringTypes extends
Rule[LogicalPlan] {
dataType.existsRecursively(isDefaultStringType)
private def isDefaultStringType(dataType: DataType): Boolean = {
+ // STRING (without explicit collation) is considered default string type.
+ // STRING COLLATE <collation_name> (with explicit collation) is not
considered
+ // default string type even when explicit collation is UTF8_BINARY
(default collation).
dataType match {
- case st: StringType =>
- // should only return true for StringType object and not
StringType("UTF8_BINARY")
- st.eq(StringType) || st.isInstanceOf[TemporaryStringType]
+ // should only return true for StringType object and not for
StringType("UTF8_BINARY")
+ case st: StringType => st.eq(StringType)
Review Comment:
I changed the above line to case StringType => true, as suggested, and this
caused the test "create/alter table with table level collation" in
DefaultCollationTestSuite to fail at line 88, checking the following condition:
assertTableColumnCollation(testTable, "c3", "UTF8_BINARY")
The check expected column c3 data type to be StringType(UTF8_BINARY), or
simply StringType, as explicitly set by the c3 STRING COLLATE UTF8_BINARY
clause, whereas in fact the actual collation was UTF8_LCASE, as replaced by the
specified table level collation.
This is because with case st: StringType => st.eq(StringType), this returned
false for StringType(UTF8_BINARY), presumably as it is a reference check where
StringType and StringType(UTF8_BINARY) are deemed not equal, whereas with case
StringType => true, this returned true for StringType(UTF8_BINARY), thus
causing this column's string type to be replaced by table level collation
string type StringType(UTF8_LCASE), which is incorrect behavior per ref spec.
But, taking a step back, I feel that this is a relatively minor detail, and
provided that we make the implementation work correctly as per ref spec
(including edge cases), we should not block on this. Let me know what you
think, thanks!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]