Suppose you have implemented an immutable Position type to represent the state of a game played on an MxN board, where the board size can grow quite large.
Or suppose you have implemented an immutable, ordered collection type. For example, the collections-extended package provides a frozensetlist[1]. One of my own packages provides a frozen, ordered bidirectional mapping type.[2] These types should be hashable so that they can be inserted into sets and mappings. The order-sensitivity of the contents prevents them from using the built-in collections.Set._hash() helper in their __hash__ implementations, to keep from unnecessarily causing hash collisions for objects that compare unequal due only to having a different ordering of the same set of contained items. According to https://docs.python.org/3/reference/datamodel.html#object.__hash__ : """ it is advised to mix together the hash values of the components of the object that also play a part in comparison of objects by packing them into a tuple and hashing the tuple. Example: def __hash__(self): return hash((self.name, self.nick, self.color)) """ Applying this advice to the use cases above would require creating an arbitrarily large tuple in memory before passing it to hash(), which is then just thrown away. It would be preferable if there were a way to pass multiple values to hash() in a streaming fashion, such that the overall hash were computed incrementally, without building up a large object in memory first. Should there be better support for this use case? Perhaps hash() could support an alternative signature, allowing it to accept a stream of values whose combined hash would be computed incrementally in *constant* space and linear time, e.g. "hash(items=iter(self))". In the meantime, what is the best way to incrementally compute a good hash value for such objects using built-in Python routines? (As a library author, it would be preferable to use a routine with explicit support for computing a hash incrementally, rather than having to worry about how to correctly combine results from multiple calls to hash(contained_item) in library code. (Simply XORing such results together would not be order-sensitive, and so wouldn't work.) Using a routine with explicit support for incremental hashing would allow libraries to focus on doing one thing well.[3,4,5]) I know that hashlib provides algorithms that support incremental hashing, but those use at least 128 bits. Since hash() throws out anything beyond sys.hash_info.hash_bits (e.g. 64) bits, anything in hashlib seems like overkill. Am I right in thinking that's the wrong tool for the job? On the other hand, would binascii.crc32 be suitable, at least for 32-bit systems? (And is there some 64-bit incremental hash algorithm available for 64-bit systems? It seems Python has no support for crc64 built in.) For example: import binascii, struct class FrozenOrderedCollection: def __hash__(self): if hasattr(self, '__hashval'): # Computed lazily. return self.__hashval hv = crc32(b'FrozenOrderedCollection') for i in self: hv = binascii.crc32(struct.pack('@l', hash(i)), hv) hv &= 0xffffffff self.__hashval = hv return hv Note that this example illustrates two other common requirements of these use cases: (i) lazily computing the hash value on first use, and then caching it for future use (ii) priming the overall hash value with some class-specific initial value, so that if an instance of a different type of collection, which comprised the same items but which compared unequal, were to compute its hash value out of the same constituent items, we make sure our hash value differs. (On that note, should the documentation in https://docs.python.org/3/reference/datamodel.html#object.__hash__ quoted above be updated to add this advice? The current advice to "return hash((self.name, self.nick, self.color))" would cause a hash collision with a tuple of the same values, even though the tuple should presumably compare unequal with this object.) To summarize these questions: 1. Should hash() add support for incremental hashing? 2. In the meantime, what is the best way to compute a hash of a combination of many values incrementally (in constant space and linear time), using only what's available in the standard library? Ideally there is some routine available that uses exactly hash_info.hash_bits number of bits, and that does the combining of incremental results for you. 3. Should the https://docs.python.org/3/reference/datamodel.html#object.__hash__ documentation be updated to include suitable advice for these use cases, in particular, that the overall hash value should be computed lazily, incrementally, and should be primed with a class-unique value? Thanks in advance for a helpful discussion, and best wishes. Josh References: [1] http://collections-extended.lenzm.net/api.html#collections_extended.frozensetlist [2] https://bidict.readthedocs.io/en/dev/api.html#bidict.frozenorderedbidict [3] http://stackoverflow.com/questions/2909106/python-whats-a-correct-and-good-way-to-implement-hash#comment28193015_19073010 [4] http://stackoverflow.com/a/2909572/161642 [5] http://stackoverflow.com/a/27952689/161642 _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/