Re: [PR] [SPARK-52828][SQL] Make hashing for collated strings collation agnostic [spark]

via GitHub Sat, 19 Jul 2025 01:01:58 -0700


uros-db commented on code in PR #51521:
URL: https://github.com/apache/spark/pull/51521#discussion_r2217228239



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala:
##########
@@ -545,11 +580,32 @@ abstract class InterpretedHashFunction {
 
   protected def hashUnsafeBytes(base: AnyRef, offset: Long, length: Int, seed: 
Long): Long
 
+  private lazy val legacyCollationAwareHashing: Boolean =
+    SQLConf.get.getConf(SQLConf.COLLATION_AWARE_HASHING_ENABLED)
+
   /**
-   * Computes hash of a given `value` of type `dataType`. The caller needs to 
check the validity
-   * of input `value`.
+   * This method is intended for callers using the old hash API and preserves 
compatibility for
+   * supported data types. It must only be used for data types that do not 
include collated strings
+   * or complex types (e.g., structs, arrays, maps) that may contain collated 
strings.
+   *
+   * The caller is responsible for ensuring that `dataType` does not involve 
collation-aware fields.
+   * This is validated via an internal assertion.
+   *
+   * @throws IllegalArgumentException if `dataType` contains non-UTF8 binary 
collation.
    */
   def hash(value: Any, dataType: DataType, seed: Long): Long = {
+    require(!SchemaUtils.hasNonUTF8BinaryCollation(dataType))
+    // For UTF8_BINARY, hashing behavior is the same regardless of the 
isCollationAware flag.
+    hash(value = value, dataType = dataType, seed = seed, isCollationAware = 
false)
+  }

Review Comment:
   Thank you Milan!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-52828][SQL] Make hashing for collated strings collation agnostic [spark]

Reply via email to