New algorithm.

pintuxgu Tue, 19 Mar 2013 01:02:26 -0700

I have an idea for resynchronizing matches between files and I'm curious what 
you think of it. I'm not sure about the quality of this idea, nor of the 
feasability of implementation.


The basic algorithm is this:

1). Take a limited number of random (sub) strings form file A (20, every 5 % 
for example).
2). Make 20 linked lists of matches with file B. (Samples "unique" enough, 
short lists).
3). Expand all unique matches (Linked lists with 2 items) maximally. (Is this 
the "snake" in action?)
4). Mark all the matched text found / and the matched links between the files.
5). Repeat step 1 once (twice, recursive?). But not for the whole file, but for 
the parts between 2 previous matches.
6). We have now a maximum of 20x20=400 (8000 for 3 passes...)  matches between 
the files.
7). All text not marked yet is now assumed to be only in file A or in file B.
8). Maybe use a different algorithm for further refinement.

Some thoughts:
- This works beautifully if for example the order of complete functions is 
changed in a source files.
- I think It's relatively fast because of the limited number of scans through 
the files.
- File's which don't mach at all can easily be recognized (If the samples in 
step 1 are chosen properly).
- Make binary tree with snippets text marked "matched", "Only in A", "Only in 
B" ?
- How difficult would it be to implement this?
- Lots of room for all kinds of optimisations
        - Don't read file A,
        - Smart "guesses" in file B.
        - Offsets form start or end of line for matches.
        - Etc.
- Maybe a variant of this idea can be integrated in the existing matching 
algorithms.
- I'm curious what you tink of this, maybe I'm just being silly.

Greetings,
Pintuxgu.






_______________________________________________
meld-list mailing list
[email protected]
https://mail.gnome.org/mailman/listinfo/meld-list

New algorithm.

Reply via email to