+1 to doc, seed argument would be great if possible
________________________________
From: Sean Owen <sro...@gmail.com>
Sent: Monday, September 26, 2022 5:26:26 PM
To: Nicholas Gustafson <njgustaf...@gmail.com>
Cc: dev <dev@spark.apache.org>
Subject: Re: Why are hash functions seeded with 42?

Oh yeah I get why we love to pick 42 for random things. I'm guessing it was a 
bit of an oversight here as the 'seed' is directly initial state and 0 makes 
much more sense.

On Mon, Sep 26, 2022, 7:24 PM Nicholas Gustafson 
<njgustaf...@gmail.com<mailto:njgustaf...@gmail.com>> wrote:
I don’t know the reason, however would offer a hunch that perhaps it’s a nod to 
Douglas Adams (author of The Hitchhiker’s Guide to the Galaxy).

https://news.mit.edu/2019/answer-life-universe-and-everything-sum-three-cubes-mathematics-0910

On Sep 26, 2022, at 16:59, Sean Owen 
<sro...@gmail.com<mailto:sro...@gmail.com>> wrote:


OK, it came to my attention today that hash functions in spark, like xxhash64, 
actually always seed with 42: 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala#L655

This is an issue if you want the hash of some value in Spark to match the hash 
you compute with xxhash64 somewhere else, and, AFAICT most any other impl will 
start with seed=0.

I'm guessing there wasn't a great reason for this, just seemed like 42 was a 
nice default seed. And we can't change it now without maybe subtly changing 
program behaviors. And, I am guessing it's messy to let the function now take a 
seed argument, esp. in SQL.

So I'm left with, I guess we should doc that? I can do it if so.
And just a cautionary tale I guess, for hash function users.

Reply via email to