Hello On Sun, Jul 11, 2010 at 12:10 AM, Kai <[email protected]> wrote:
> > Ages ago when you first submitted the Myers diff implementation, you > said that it was slower than the difflib implementation; do you have > any idea how much slower? I'd be interested in finding out whether > switching to Myers wholesale would be a reasonable thing to do. > > Also, there is an implementation of Myers diff in Review Board (in > reviewboard/diffviewer/myersdiff.py). It may be worthwhile looking at > that code (and other related bits) to see whether there is anything we > can share/appropriate for use in Meld. > Ages ago it was indeed and things have changed :) I am attaching a small patch that boosts the performance of Myers matcher significantly. Once applied, MyersSequenceMatcher will become comparable with other existing sequence matchers. I performed a couple of tests to check the impact of this path and the results (time taken to complete the test) are attached below. I also included Myers diff from Review Board, but it looks like their implementation cannot offer anything when it comes to performance (the result of rewriting GNU diff in Python with no Python-specific optimizations). There is one post-processing trick (shift inserts/deletions of identical lines) that may be worth adding to meld though. However, more testing is definitely needed. I will be off for 2 weeks, but when I am back I will try to test a couple of scenarios related to inline changes highlighting as they are slightly different due to very low number of unique lines (each line = 1 character). Regards, Piotr Test: calculate diff 100 times in a loop (I used historical revisions of filediffer.py and meldapp.py) 1. Almost identical files (1452 lines in common, 4 different lines): IncrementalSequenceMatcher: 0:00:00.542536 PatienceSequenceMatcher: 0:00:01.019957 MyersSequenceMatcher: 0:00:00.034457 MyersDiffer (Review Board): 0:00:01.436764 2 .Common: 859 diferent: 82 (8.5%) IncrementalSequenceMatcher: 0:00:00.310261 PatienceSequenceMatcher: 0:00:00.597474 MyersSequenceMatcher: 0:00:00.280938 MyersDiffer (Review Board): 0:00:01.021952 3. Common: 1259 diferent: 252 (17%) IncrementalSequenceMatcher: 0:00:01.087525 PatienceSequenceMatcher: 0:00:00.920743 MyersSequenceMatcher: 0:00:00.536368 MyersDiffer (Review Board): 0:00:02.267483 4. Common: 472 diferent: 678 (59%) IncrementalSequenceMatcher: 0:00:00.780251 PatienceSequenceMatcher: 0:00:00.490673 MyersSequenceMatcher: 0:00:00.368780 MyersDiffer (Review Board): 0:00:01.844979 5. Completely different files (common: 125, different: 1668): IncrementalSequenceMatcher: 0:00:00.737094 PatienceSequenceMatcher: 0:00:00.405088 MyersSequenceMatcher: 0:00:00.636358 MyersDiffer (Review Board): 0:00:02.308354
From 0cb38b90d4c984bc6a4e21a1145375e11b94fef2 Mon Sep 17 00:00:00 2001 From: Piotr Piastucki <[email protected]> Date: Mon, 12 Jul 2010 00:05:37 +0200 Subject: [PATCH] Myers matcher performance improvement --- meld/matchers.py | 10 ++++++++-- 1 files changed, 8 insertions(+), 2 deletions(-) diff --git a/meld/matchers.py b/meld/matchers.py index fb81c24..f9ed34d 100644 --- a/meld/matchers.py +++ b/meld/matchers.py @@ -101,17 +101,23 @@ class MyersSequenceMatcher(difflib.SequenceMatcher): # discard lines that do not match any line from the other file if n > 0 and m > 0: + aset = set() + bset = set() + for newline in b: + bset.add(newline) + for newline in a: + aset.add(newline) a2 = [] b2 = [] j = 0 for i, newline in enumerate(b): - if newline in a: + if newline in aset: b2.append(newline) bindex[j] = i j += 1 k = 0 for i, origline in enumerate(a): - if origline in b: + if origline in bset: a2.append(a[i]) aindex[k] = i k += 1 -- 1.7.0.4
_______________________________________________ meld-list mailing list [email protected] http://mail.gnome.org/mailman/listinfo/meld-list
