Simon Forman wrote: > Andrew Robert wrote: >> Simon Forman wrote: >>> Paul Rubin wrote: >>>> "EP" <[EMAIL PROTECTED]> writes: >>>>> Given that I am looking for matches of all files against all other >>>>> files (of similar length) is there a better bet than using re.search? >>>>> The initial application concerns files in the 1,000's, and I could use >>>>> a good solution for a number of files in the 100,000's. >>>> If these are text files, typically you'd use the Unix 'diff' utility >>>> to locate the differences. >>> If you can, you definitely want to use diff. Otherwise, the difflib >>> standard library module may be of use to you. Also, since you're >>> talking about comparing many files to each other, you could pull out a >>> substring of one file and use the 'in' "operator" to check if that >>> substring is in another file. Something like this: >>> >>> f = open(filename) # or if binary open(filename, 'rb') >>> f.seek(somewhere_in_the_file) >>> substr = f.read(some_amount_of_data) >>> f.close() >>> >>> try_diffing_us = [] >>> for fn in list_of_filenames: >>> data = open(fn).read() # or again open(fn, 'rb')... >>> if substr in data: >>> try_diffing_us.append(fn) >>> >>> # then diff just those filenames... >>> >>> That's a naive implementation but it should illustrate how to cut down >>> on the number of actual diffs you'll need to perform. Of course, if >>> your files are large it may not be feasible to do this with all of >>> them. But they'd have to be really large, or there'd have to be lots >>> and lots of them... :-) >>> >>> More information on your actual use case would be helpful in narrowing >>> down the best options. >>> >>> Peace, >>> ~Simon >>> >> Would it be more efficient to checksum the files and then only diff the ones >> that fail a checksum compare? >> > > The thing about a checksum algorithm is that if you call it with some > data, then change even one byte of the data and call it again, the > resulting checksums will be (should be) very different. > > Checksumming the entire file contents won't help, but as bearophile > said: "If you can determine of a given a file where its heading garbage > stops, then you can compute the signature just computing the python > hash of some of the following bytes (30 or > 200 byte may suffice)." > > If the OP can do that then yes, it would likely be more efficient > (although computing and comparing checksums on just 200 bytes or less > might not be a significant gain on simply comparing the strings > themselves.) But if he can't then the next best thing would be to take > a subsection of the file somewhere after the heading cruft and search > (using string 'in' string form) for that subsection in other files. > (Actually there may be a better option than that, but I'm just not > bright enough to have thought of it...) > >> Utilizing the functions below may be of some help. >> >> >> #!/usr/bin/python >> # >> # >> # Function: generate and compare checksums on a file >> >> >> import md5, sys >> >> >> def getsum(filename): >> """ >> Generate the check sum based on received chunks of the file >> """ >> md5sum = md5.new() >> f = open(filename, 'r') >> for line in getblocks(f) : >> md5sum.update(line) >> f.close() >> return md5sum.hexdigest() >> >> def getblocks(f, blocksize=1024): >> """ >> Read file in small chunks to avoid having large files loaded into >> memory >> """ >> while True: >> s = f.read(blocksize) >> if not s: break >> yield s >> >> def checksum_compare(caller, cs='',check='', filename=''): >> """ >> Compare the generated and received checksum valued >> """ >> if cs != check: >> return 1 # compare failed >> else: >> return 0 # compare successful >> > > I'm curious why you included this elaborate function with it's unused > args (caller and filename), unnecessary defaults, and it's odd > inversion of Boolean values.. > > How is "if not checksum_compare(something, sum0, sum1): #do > something..." any better than "if sum0 == sum1: #do something..." ? > > Peace, > ~Simon > Because I was lazy..
The checksume_compare came from something else I wrote that had special logging and e-mailer calls in it. Should have ripped the reference to caller and file name out.. -- http://mail.python.org/mailman/listinfo/python-list