[
https://issues.apache.org/jira/browse/SPARK-45900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Nathan Holland updated SPARK-45900:
-----------------------------------
Summary: Expand hash functionalities from to include XXH3 (was: Update
hash functionalities from xxHash64 to XXH3)
> Expand hash functionalities from to include XXH3
> ------------------------------------------------
>
> Key: SPARK-45900
> URL: https://issues.apache.org/jira/browse/SPARK-45900
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 3.5.0
> Reporter: Nathan Holland
> Priority: Major
>
> I often work in projects that require deterministic randomness, especially
> when creating surrogate keys. For small volumes of data xxhash64 works well
> however this functionality doesn't scale well - with a 64-bit hash code, the
> chance of collision is one in a million when you hash just six million items
> increasing sharply due to the birthday paradox.
> Currently there are a few ways to handle this
> - hash: 32-bit output (>50% chance of collision at least one for tables
> larger than 77000 rows, and likely ~1.6 billion collisions in a table of size
> 4 billion)
> - xxhash64: 64-bit output (>50% chance of collision at least one for tables
> larger than 5 billion rows)
> - shaXXX/md5: single binary column input, string output, quite
> computationally expensive.
> I'd suggest adding the newest algorithm in the xxhash64 family, XXH3. The
> XXH3 family is a modern 64-bit and 128-bit hash function family that provides
> improved strength and performance across the board.
> I'd imagine this would be a new function named xxhash3 and take 64 bit, and
> 128bit bit lengths. For usability I believe the bit length should default to
> 128bits to provide the best experience to reduce accidental collisions and
> leave users to set the bit length to 64 as an override if they need to for
> additional performance or interop reasons. (given the benchmarks, this would
> likely be quite rare)
> References:
> * [Documentation|https://xxhash.com/]
> * xxHash64 Ticket
> * [Existing xxHash64
> logic|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/XXH64.java]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]