On Wed, Dec 28, 2016 at 11:48 AM, Ned Batchelder <n...@nedbatchelder.com> wrote:
> You can write a simple function to use hash iteratively to hash the entire > stream in constant space and linear time: > > def hash_stream(them): > val = 0 > for it in them: > val = hash((val, it)) > return val > > Although this creates N 2-tuples, they come and go, so the memory use > won't grow. Adjust the code as needed to achieve canonicalization before > iterating. > > Or maybe I am misunderstanding the requirements? > This is better than solutions like http://stackoverflow.com/a/ 27952689/161642 in the sense that it's a little higher level (no bit shifting or magic numbers). But it's not clear that it's any better in the sense that you're still rolling your own incremental hash algorithm out of a lower-level primitive that doesn't document support for this, and therefore taking responsibility yourself for how well it distributes values into buckets. Are you confident this results in good hash performance? Is this better than a solution built on top of a hash function with an explicit API for calculating a hash incrementally, such as the crc32 example I included? (And again, this would ideally be a sys.hash_info.hash_bits -bit algorithm.) Don't we still probably want either: 1) Python to provide some such hash_stream() function as a built-in, or failing that, 2) the https://docs.python.org/3/reference/datamodel.html#object.__hash__ documentation to bless this as the recommended solution to this problem, thereby providing assurance of its performance? If that makes sense, I'd be happy to file an issue, and include the start of a patch providing either 1 or 2. Thanks very much for the helpful response.
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/