It was brought to my attention that there is an open ticket for this. Moving discussion there:
https://bugs.python.org/issue45150 Aur On Sun, Mar 13, 2022 at 6:19 PM Aur Saraf <sonofli...@gmail.com> wrote: > Hi, > > TL;DR: you can read just the code snippets and the last paragraph. > > First of all, I'm assuming hashlib is used to calculate hashes of large > files in production very very often, such that even small performance and > usability improvements would make a huge difference. If you don't share > this assumption, delete this email :-) > > Today, hashlib.update() accepts only a bytes(), and many StackOverflow > answers include code like: > > sha256 = hashlib.sha256() > with open(path) as f: > while True: > data = f.read(BUF_SIZE) > if not data: > break > sha256.update(data) > return sha256.hexdigest() > > This is problematic because not everybody knows this pattern, so many > codebases will include code like: > > with open(path) as f: > return sha256(f.read()).hexdigest() > > and, frankly, who can blame them. > > It is also problematic for performance reasons - even if we know to do the > chunking thing, while hashing a large file, the GIL will be taken and > released many times, and many buffers will be allocated and deallocated. > > As far as I can see, hashlib already has a lock per hash object and safely > releases the GIL in update() with a long bytes(), so it would be safe to > add an option for update()/new() to take a file pointer and do the chunked > reading/updating with one static buffer with the GIL released throughout, > so that this would be > > with open(path) as f: > return sha256(f).hexdigest() > > We can discuss whether this is the best API or its preferable to have > > with open(path) as f: > return sha256.from_file(f).hexdigest() > > or > > with open(path) as f: > return sha256().update_from_file(f).hexdigest() > > but I submit that today many people try sha256(f).hexdigest() because > they're used to e.g. json and csv accepting file objects, and that today > passing a file object raises, so making both new() and update() accept file > objects would be the most beginner-friendly and won't break anything. > > Knowing that there are a million tiny details that need to be... hashed > out, and given that I'm willing to write the code, would the devs be > receptive to something like this? > > Thanks, > Aur >
_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-le...@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/WZF5POSOP4A6N4YRY4KBHPCTIZBEVE2Z/ Code of Conduct: http://python.org/psf/codeofconduct/