[Python-ideas] Re: set arbitrary hash random seed to ensure reproducible results

Hao Hu Sat, 18 Dec 2021 13:00:55 -0800

On 12/18/21 08:44, Stephen J. Turnbull wrote:

Hao Hu writes:
  > > On 17 Dec 2021, at 15:28, Chris Angelico <ros...@gmail.com> wrote:


  > > The built-in hash() function is extremely generic, so it can't really
  > > work that way. Adding a parameter to it would require (a) adding the
  > > parameter to every __hash__ method of every object, including
  > > user-defined objects;
  >
  > I would not say the opposite, however maybe it appears to be more
  > complicated than it is really is. Probably it is worth a small
  > analysis?

It's the user-defined objects that are the killer here.  We don't want
to go wrecking dozens of projects' objects.

  > >> For instance, if we create a caching programming interface that
  > >> relies on a distributed kv store,

I would be very suspicious of using Python's hash builtin for such a
purpose.  The Python hash functions are very carefully tuned for high
performance in one application only: equality testing in Python,
especially for dicts.  Many __hash__ methods omit much of the object
being hashed; if the variation in your keys depends only on those
attributes, you'll get a lot of collisions.  Others are extremely
predictable.  E.g., most integers and other numbers equal to integers
hash to themselves mod 2**61 - 1, I believe -1 is only exception.
Being predictable as such may not be a problem for your kv store
cache, but predictable == pattern, and if your application happens to
match that pattern, you could again end up with a massive collision
problem.  I imagine this is much less likely to be a problem than the
case where keys depend on omitted attributes, since presumably the
__hash__ method is designed to cover the whole range.  And numbers are
the only case I know of offhand.

It is pretty much the same use case as python's dictionary though, the goal is just to generalize it to use with a distributed kv store. Another big advantage is that it is more user friendly to apply *hash* directly on a type.


  > > I'd recommend hashlib:

+1

  > Otherwise, would that be useful to add siphash24 or fnv in the
  > hashlib as well?

I think that is a good idea.  To me, it seems relatively likely to be
accepted quickly.  However, many cryptographic algorithms are delicate
(eg, to avoid timing attacks), so I could be wrong about that.  Folks
like Christian Heimes might be very concerned about the implementation
as well as the algorithm.

Note that Python/pyhash.c seems to have implementations of both of
these algorithms, although I don't know if these implementations
satisfy cryptographic needs.

According to the doc, there seems to be 2 categories of hash function. One is for cryptographic purpose, another one is for message authentication code.


The algorithms mentioned above could be mostly put into the second category.

Steve

smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/FBYX6MPTZGQUPQICYGYOPMLGAELUVF2H/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: set arbitrary hash random seed to ensure reproducible results

Reply via email to