Vlastimil Brom <vlastimil.b...@gmail.com> added the comment: I just stumbled on some seemingly different unexpected behaviour of difflib.SequenceMatcher, but it turns out, it may have the same cause, i.e. the "popular" heuristics. I hopefully managed to replicate it on an illustrative sample text - in as included in the attached file. (I also mentioned this issue in hte python-list http://mail.python.org/pipermail/python-list/2010-April/1241951.html but as there were no replies I eventually found, this might be more appropriate place.) Both strings differ in a minimal way, each having one extra character in a "strategic" position, which probably meets some pathological case for difflib. Instead of just reporting the insertion and deletion of these single characters (which works well for most cases - with most other positions of the differing characters), the output of the SequenceMatcher decides to delete a large part of the string in between the differences and to insert the almost same text after that. The attached code simply prints the results of the comparison with the respective tags, and substrings. No junk function is used. I get the same results on Python 2.5.4, 2.6.5, 3.1.1 on windows XPp SP3. I didn't find any plausible mentions of such cases in the documentation, but after some searching I found several reports in the bug tracker mentioning the erroneous output of SequenceMatcher on longer repetitive sequences.
besides this http://bugs.python.org/issue2986 e.g. http://bugs.python.org/issue1711800 http://bugs.python.org/issue4622 http://bugs.python.org/issue1528074 In my case, disabling the "popular" heuristics as mentioned by John Machin in http://bugs.python.org/issue1528074#msg29269 seems to have solved the problem; with a modified version of difflib containing: if 0: # disable popular heuristics if n >= 200 and len(indices) * 100 > n: populardict[elt] = 1 del indices[:] the comparison catches the differences in the test strings as expected - i.e. one character addition and deletion only. It is likely, that some other use cases for difflib may rely on the "popular"-heuristics but it also seems useful to have some control over this behaviour, which might not be appropriate in all cases. (The issue seems to be the same in python 2.5, 2.6 and 3.1.) regards, vbr ---------- nosy: +vbr Added file: http://bugs.python.org/file17001/difflib_test_inq.py _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue2986> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com