Nathan Holland created SPARK-45900:
--------------------------------------
Summary: Update hash functionalities from xxHash64 to XXH3
Key: SPARK-45900
URL: https://issues.apache.org/jira/browse/SPARK-45900
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 3.5.0
Reporter: Nathan Holland
I often work in projects that require deterministic randomness, especially when
creating surrogate keys. For small volumes of data xxhash64 works well however
this functionality doesn't scale well - with a 64-bit hash code, the chance of
collision is one in a million when you hash just six million items increasing
sharply due to the birthday paradox.
Currently there are a few ways to handle this
- hash: 32-bit output (>50% chance of collision at least one for tables larger
than 77000 rows, and likely ~1.6 billion collisions in a table of size 4
billion)
- xxhash64: 64-bit output (>50% chance of collision at least one for tables
larger than 5 billion rows)
- shaXXX/md5: single binary column input, string output, quite computationally
expensive.
I'd suggest adding the newest algorithm in the xxhash64 family, XXH3. The XXH3
family is a modern 64-bit and 128-bit hash function family that provides
improved strength and performance across the board.
I'd imagine this would be a new function named xxhash3 and take 64 bit, and
128bit bit lengths. For usability I believe the bit length should default to
128bits to provide the best experience to reduce accidental collisions and
leave users to set the bit length to 64 as an override if they need to for
additional performance or interop reasons. (given the benchmarks, this would
likely be quite rare)
References:
* [Documentation|https://xxhash.com/]
* xxHash64 Ticket
* [Existing xxHash64
logic|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/XXH64.java]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]