Re: [PR] [SPARK-49018][SQL] Fix approx_count_distinct not working correctly with collation [spark]

via GitHub Mon, 29 Jul 2024 08:15:26 -0700


nikolamand-db commented on code in PR #47503:
URL: https://github.com/apache/spark/pull/47503#discussion_r1695414449



##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala:
##########
@@ -2319,6 +2319,30 @@ class CollationSQLExpressionsSuite
     )
   }
 
+  // scalastyle:off nonascii
+  test("approx_count_distinct returns correct with collation") {
+    Seq(
+      ("utf8_lcase", Seq("a", "a", "A"), Seq(Row(1))),
+      ("utf8_lcase", Seq("aCfew", "acFEw", "ACFEW"), Seq(Row(1))),
+      ("UNICODE_CI_AI", Seq("č", "c", "Ć", "C", "Z", "Ž"), Seq(Row(2))),
+      ("UNICODE_CI_AI", Seq("fAFfewfćČĆfečwćCsŠ", "fAffEWfćČCfeCwcćss"), 
Seq(Row(1))),
+      ("UNICODE_CI_AI", Seq("fAcfewfćČĆfečwćCsŠ", "fAffEWfćČCfeCwcćss"), 
Seq(Row(2)))

Review Comment:
   Please add some more edge-cases checks. Examples of such scenarios:
   
   - single string
   - repeat same string several times
   - repetitive combinations of lowercase/uppercase version of the same letter 
(e.g. `aaa, aAa, Aaa, AAA, ...`)



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/HyperLogLogPlusPlusHelper.scala:
##########
@@ -93,6 +94,8 @@ class HyperLogLogPlusPlusHelper(relativeSD: Double) extends 
Serializable {
     val value = dataType match {
       case FloatType => FLOAT_NORMALIZER.apply(_value)
       case DoubleType => DOUBLE_NORMALIZER.apply(_value)
+      case _: StringType => CollationFactory.getCollationKeyBytes(
+        _value.asInstanceOf[UTF8String], 
dataType.asInstanceOf[StringType].collationId)

Review Comment:
   Let's try to avoid `asInstanceOf`. For this particular case, we can use 
`case s: StringType`, where we can extract information about collations from 
`s`.



##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala:
##########
@@ -2319,6 +2319,30 @@ class CollationSQLExpressionsSuite
     )
   }
 
+  // scalastyle:off nonascii
+  test("approx_count_distinct returns correct with collation") {
+    Seq(
+      ("utf8_lcase", Seq("a", "a", "A"), Seq(Row(1))),
+      ("utf8_lcase", Seq("aCfew", "acFEw", "ACFEW"), Seq(Row(1))),
+      ("UNICODE_CI_AI", Seq("č", "c", "Ć", "C", "Z", "Ž"), Seq(Row(2))),
+      ("UNICODE_CI_AI", Seq("fAFfewfćČĆfečwćCsŠ", "fAffEWfćČCfeCwcćss"), 
Seq(Row(1))),
+      ("UNICODE_CI_AI", Seq("fAcfewfćČĆfečwćCsŠ", "fAffEWfćČCfeCwcćss"), 
Seq(Row(2)))
+    ).foreach{

Review Comment:
   ```suggestion
       ).foreach {
   ```



##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala:
##########
@@ -2319,6 +2319,30 @@ class CollationSQLExpressionsSuite
     )
   }
 
+  // scalastyle:off nonascii
+  test("approx_count_distinct returns correct with collation") {
+    Seq(
+      ("utf8_lcase", Seq("a", "a", "A"), Seq(Row(1))),
+      ("utf8_lcase", Seq("aCfew", "acFEw", "ACFEW"), Seq(Row(1))),
+      ("UNICODE_CI_AI", Seq("č", "c", "Ć", "C", "Z", "Ž"), Seq(Row(2))),
+      ("UNICODE_CI_AI", Seq("fAFfewfćČĆfečwćCsŠ", "fAffEWfćČCfeCwcćss"), 
Seq(Row(1))),
+      ("UNICODE_CI_AI", Seq("fAcfewfćČĆfečwćCsŠ", "fAffEWfćČCfeCwcćss"), 
Seq(Row(2)))
+    ).foreach{
+      case (collationName: String, input: Seq[String], expected: Seq[Row]) =>
+        checkAnswer(sql(
+          s"""
+             with t as (
+             select collate(col1, '$collationName') as c
+             from
+             values ${input.map(s => s"('$s')").mkString(", ")}
+          )
+          SELECT approx_count_distinct(c) FROM t
+             """), expected)
+    }

Review Comment:
   Let's use the following pattern:
   
   1. Create `case class` which encapsulates the specific check we're going to 
make
   2. Construct sequence of instances of the previously created case class
   3. Iterate the sequence and perform required checks; this is already in 
place, but I suggest we use `CREATE TABLE` + `INSERT` statements instead of 
`WITH` to make the code more readable
   4. Change session-level default collation and perform the checks again to 
verify `approx_count_distinct` works well when non-`UTF8_BINARY` collation is 
session default
   
   Example test which already follows this approach: 
https://github.com/apache/spark/blob/5d16c3134c442a5546251fd7c42b1da9fdf3969e/sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala#L42-L70



##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala:
##########
@@ -2319,6 +2319,30 @@ class CollationSQLExpressionsSuite
     )
   }
 
+  // scalastyle:off nonascii
+  test("approx_count_distinct returns correct with collation") {
+    Seq(
+      ("utf8_lcase", Seq("a", "a", "A"), Seq(Row(1))),
+      ("utf8_lcase", Seq("aCfew", "acFEw", "ACFEW"), Seq(Row(1))),
+      ("UNICODE_CI_AI", Seq("č", "c", "Ć", "C", "Z", "Ž"), Seq(Row(2))),
+      ("UNICODE_CI_AI", Seq("fAFfewfćČĆfečwćCsŠ", "fAffEWfćČCfeCwcćss"), 
Seq(Row(1))),
+      ("UNICODE_CI_AI", Seq("fAcfewfćČĆfečwćCsŠ", "fAffEWfćČCfeCwcćss"), 
Seq(Row(2)))
+    ).foreach{
+      case (collationName: String, input: Seq[String], expected: Seq[Row]) =>
+        checkAnswer(sql(
+          s"""
+             with t as (
+             select collate(col1, '$collationName') as c
+             from
+             values ${input.map(s => s"('$s')").mkString(", ")}
+          )
+          SELECT approx_count_distinct(c) FROM t
+             """), expected)
+    }
+

Review Comment:
   ```suggestion
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-49018][SQL] Fix approx_count_distinct not working correctly with collation [spark]

Reply via email to