Er.. obviousl, when i wrote : "scan the similarity algorithm and find all the diagonals", I meant scan the "similarity matrix and find all the diagonals".
"Similarity matrix" should really be called "Equality matrix", as I imagine it as a matrix with dimensions len(a) x len(b) where M[x][y] = (a[x] == b[y]) On 9/12/07, Kurdt Bane <[EMAIL PROTECTED]> wrote: > > Hi to all! > For reverse engineering purposes, I need to find where every possible > chunk of bytes in file A is contained in file B. Obviously, if a chunk of > length n is contained in B, I dont' want my script to recognize also all the > subchunks of size < n contained in the chunk. > > I coded a naive implementation of a similarity algorithm: scan the > similarity algorithm and find all the diagonals. Here's the code: > > file_a = open(argv[1]) > file_b = open(argv[2]) > > a = numpy.fromstring (file_a.read(),'c') > b = numpy.fromstring(file_b.read(),'c') > > tolerance = int(argv[3]) > > chunks = [] > valid = True > > count_xcent = 0 > for x in xrange(len(a)): > for y in xrange(len(b)): > count = 0 > if (a[x] == b[y]): > x_cnt, y_cnt = x,y > if (a[x_cnt] == b[y_cnt]): > try: > while (a[x_cnt+1] == b[y_cnt+1]): > count += 1 > x_cnt += 1 > y_cnt += 1 > except IndexError: > pass > if ((count > tolerance) or (count == tolerance)): > for tuple in chunks: > if (((x >= tuple[0]) and (x_cnt <= tuple[1])) and > ((y >= tuple[2]) and (y_cnt <= tuple[3]))): > valid = False > if __debug__: > print "Found an already processed > subchunk" > break > if (valid): > chunks.append ((x, x_cnt, y, y_cnt)) > print "Corresponding chunk found. List 1: from " + > str(x) + " to " + str(x_cnt) +". List 2: from " + str(y) + " to " + > str(y_cnt) > print "with length of " + str (x_cnt + 1 - x) > else: > valid = True > > > It simply scans the similarity matrix, finds the diagonals in wich a[x] == > a[y] and interprets the diagonal as a chunk. Then, it stores the chunk in a > list and determines if it's a subchunk of another greater chunk already > found. > > The problem is: this implementation is very slow, imho because of three > factors: > > 1. I use a nested for loop to scan the matrix > > 2. When the start of a diagonal is found, the program scans the diagonal > with another additional loop. Maybe it would be faster to use a function > such as diagonal() (but I can't actually _create_ the boolean similarity > matrix, as it gets way too big for files big enough (we're talking about ~ > 100 Kb files) - I am forced to compute it "on the way". > > 3. When I find a chunk, I compute all its subchunks with this approach, > and compare them with the "big" chunk I stored in the list. Maybe there's a > better algorithmical way. > > What do you think about these issues? Is there a way to optimize them? And > are there other issues I didn't take in account? > > Thanks in advance, > > regards, > > Chris. >
_______________________________________________ Numpy-discussion mailing list [email protected] http://projects.scipy.org/mailman/listinfo/numpy-discussion
