Simon Forman wrote: > Paul Rubin wrote: >> "EP" <[EMAIL PROTECTED]> writes: >>> Given that I am looking for matches of all files against all other >>> files (of similar length) is there a better bet than using re.search? >>> The initial application concerns files in the 1,000's, and I could use >>> a good solution for a number of files in the 100,000's. >> If these are text files, typically you'd use the Unix 'diff' utility >> to locate the differences. > > If you can, you definitely want to use diff. Otherwise, the difflib > standard library module may be of use to you. Also, since you're > talking about comparing many files to each other, you could pull out a > substring of one file and use the 'in' "operator" to check if that > substring is in another file. Something like this: > > f = open(filename) # or if binary open(filename, 'rb') > f.seek(somewhere_in_the_file) > substr = f.read(some_amount_of_data) > f.close() > > try_diffing_us = [] > for fn in list_of_filenames: > data = open(fn).read() # or again open(fn, 'rb')... > if substr in data: > try_diffing_us.append(fn) > > # then diff just those filenames... > > That's a naive implementation but it should illustrate how to cut down > on the number of actual diffs you'll need to perform. Of course, if > your files are large it may not be feasible to do this with all of > them. But they'd have to be really large, or there'd have to be lots > and lots of them... :-) > > More information on your actual use case would be helpful in narrowing > down the best options. > > Peace, > ~Simon >
Would it be more efficient to checksum the files and then only diff the ones that fail a checksum compare? Utilizing the functions below may be of some help. #!/usr/bin/python # # # Function: generate and compare checksums on a file import md5, sys def getsum(filename): """ Generate the check sum based on received chunks of the file """ md5sum = md5.new() f = open(filename, 'r') for line in getblocks(f) : md5sum.update(line) f.close() return md5sum.hexdigest() def getblocks(f, blocksize=1024): """ Read file in small chunks to avoid having large files loaded into memory """ while True: s = f.read(blocksize) if not s: break yield s def checksum_compare(caller, cs='',check='', filename=''): """ Compare the generated and received checksum valued """ if cs != check: return 1 # compare failed else: return 0 # compare successful -- Adversity: That which does not kill me only postpones the inevitable. -- http://mail.python.org/mailman/listinfo/python-list