Ola Natvig <[EMAIL PROTECTED]> wrote: > Hi all > > Does anyone know of a fast way to calculate checksums for a large file. > I need a way to generate ETag keys for a webserver, the ETag of large > files are not realy nececary, but it would be nice if I could do it. I'm > using the python hash function on the dynamic generated strings (like in > page content) but on things like images I use the shutil's > copyfileobject function and the hash of a fileobject's hash are it's > handlers memmory address. > > Does anyone know a python utility which is possible to use, perhaps > something like the md5sum utility on *nix systems.
Here is an implementation of md5sum in python. Its the same speed give or take as md5sum itself. This isn't suprising since md5sum is dominated by CPU usage of the MD5 routine (in C in both cases) and/or io (also in C). I discarded the first run so both tests ran with large_file in the cache. $ time md5sum large_file e7668fdc06b68fbf087a95ba888e8054 large_file real 0m1.046s user 0m0.946s sys 0m0.071s $ time python md5sum.py large_file e7668fdc06b68fbf087a95ba888e8054 large_file real 0m1.033s user 0m0.926s sys 0m0.108s $ ls -l large_file -rw-r--r-- 1 ncw ncw 115933184 Jul 8 2004 large_file """ Re-implementation of md5sum in python """ import sys import md5 def md5file(filename): """Return the hex digest of a file without loading it all into memory""" fh = open(filename) digest = md5.new() while 1: buf = fh.read(4096) if buf == "": break digest.update(buf) fh.close() return digest.hexdigest() def md5sum(files): for filename in files: try: print "%s %s" % (md5file(filename), filename) except IOError, e: print >> sys.stderr, "Error on %s: %s" % (filename, e) if __name__ == "__main__": md5sum(sys.argv[1:]) -- Nick Craig-Wood <[EMAIL PROTECTED]> -- http://www.craig-wood.com/nick -- http://mail.python.org/mailman/listinfo/python-list