[issue2986] difflib.SequenceMatcher not matching long sequences

Vlastimil Brom Mon, 19 Apr 2010 16:25:16 -0700

Vlastimil Brom <vlastimil.b...@gmail.com> added the comment:

I just stumbled on some seemingly different unexpected behaviour of
difflib.SequenceMatcher, but it turns out, it may have the same cause, i.e. the 
"popular" heuristics.
I hopefully managed to replicate it on an illustrative sample text - in as 
included in the attached file. (I also mentioned this issue in hte python-list 
http://mail.python.org/pipermail/python-list/2010-April/1241951.html but as 
there were no replies I eventually found, this might be more appropriate place.)
Both strings differ in a minimal way, each having one extra character
in a "strategic" position, which probably meets some pathological case
for difflib.
Instead of just reporting the insertion and deletion of these single
characters (which works well for most cases - with most other
positions of the differing characters), the output of the
SequenceMatcher decides to delete a large part of the string in
between the differences and to insert the almost same text after that.
The attached code simply prints the results of the comparison with the
respective tags, and substrings. No junk function is used.
I get the same results on Python 2.5.4, 2.6.5, 3.1.1 on windows XPp SP3.
I didn't find any plausible mentions of such cases in the documentation, but 
after some searching I found several reports in the bug tracker mentioning the 
erroneous output of SequenceMatcher on longer repetitive sequences.


besides this
http://bugs.python.org/issue2986
e.g.
http://bugs.python.org/issue1711800
http://bugs.python.org/issue4622
http://bugs.python.org/issue1528074

In my case, disabling the "popular" heuristics as mentioned by John Machin in
http://bugs.python.org/issue1528074#msg29269

seems to have solved the problem; with a modified version of difflib containing:

                if 0:   # disable popular heuristics
                    if n >= 200 and len(indices) * 100 > n:
                        populardict[elt] = 1
                        del indices[:]

the comparison catches the differences in the test strings as expected - i.e. 
one character addition and deletion only. It is likely, that some other use 
cases for difflib may rely on the "popular"-heuristics but it also seems useful 
to have some control over this behaviour, which might not be appropriate in all 
cases.
(The issue seems to be the same in python 2.5, 2.6 and 3.1.)

regards,
   vbr

----------
nosy: +vbr
Added file: http://bugs.python.org/file17001/difflib_test_inq.py

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue2986>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2986] difflib.SequenceMatcher not matching long sequences

Reply via email to