Hi,

I’m working on something that requires deterministic randomness, i.e. a row 
gets the same “random” value no matter the order of the DataFrame. A seeded 
hash seems to be the perfect way to do this, but the existing hashes have 
various limitations:

- hash: 32-bit output (only 4 billion possibilities will result in a lot of 
collisions for many tables: the birthday paradox implies  >50% chance of at 
least one for tables larger than 77000 rows, and likely ~1.6 billion collisions 
in a table of size 4 billion)
- sha1/sha2/md5: single binary column input, string output

It seems there’s already support for a 64-bit hash function that can work with 
an arbitrary number of arbitrary-typed columns (XxHash64), and exposing this 
for DataFrames seems like it’s essentially one line in sql/functions.scala to 
match `hash` (plus docs, tests, function registry etc.):

    def hash64(cols: Column*): Column = withExpr { new 
XxHash64(cols.map(_.expr)) }

For my use case, this can then be used to get a 64-bit “random” column like

    val seed = rng.nextLong()
    hash64(lit(seed), col1, col2)

I’ve created a (hopefully) complete patch by mimicking ‘hash’ at 
https://github.com/apache/spark/compare/master...huonw:hash64; should I open a 
JIRA and submit it as a pull request?

Additionally, both hash and the new hash64 already have support for being 
seeded, but this isn’t exposed directly and instead requires something like the 
`lit` above. Would it make sense to add overloads like the following?

    def hash(seed: Int, cols: Columns*) = …
    def hash64(seed: Long, cols: Columns*) = …

Though, it does seem a bit unfortunate to be forced to pass the seed first.

(I sent this email to u...@spark.apache.org a few days ago, but didn't get any 
discussion about the Spark aspects of this, so I'm resending it here; I 
apologise in advance if I'm breaking protocol!)

- Huon Wilson


---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to