I have an idea for resynchronizing matches between files and I'm curious what
you think of it. I'm not sure about the quality of this idea, nor of the
feasability of implementation.
The basic algorithm is this:
1). Take a limited number of random (sub) strings form file A (20, every 5 %
for example).
2). Make 20 linked lists of matches with file B. (Samples "unique" enough,
short lists).
3). Expand all unique matches (Linked lists with 2 items) maximally. (Is this
the "snake" in action?)
4). Mark all the matched text found / and the matched links between the files.
5). Repeat step 1 once (twice, recursive?). But not for the whole file, but for
the parts between 2 previous matches.
6). We have now a maximum of 20x20=400 (8000 for 3 passes...) matches between
the files.
7). All text not marked yet is now assumed to be only in file A or in file B.
8). Maybe use a different algorithm for further refinement.
Some thoughts:
- This works beautifully if for example the order of complete functions is
changed in a source files.
- I think It's relatively fast because of the limited number of scans through
the files.
- File's which don't mach at all can easily be recognized (If the samples in
step 1 are chosen properly).
- Make binary tree with snippets text marked "matched", "Only in A", "Only in
B" ?
- How difficult would it be to implement this?
- Lots of room for all kinds of optimisations
- Don't read file A,
- Smart "guesses" in file B.
- Offsets form start or end of line for matches.
- Etc.
- Maybe a variant of this idea can be integrated in the existing matching
algorithms.
- I'm curious what you tink of this, maybe I'm just being silly.
Greetings,
Pintuxgu.
_______________________________________________
meld-list mailing list
[email protected]
https://mail.gnome.org/mailman/listinfo/meld-list