Patches item #1678339, was opened at 2007-03-11 11:11
Message generated for change (Comment added) made by jimjjewett
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=1678339&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Tests
Group: Python 2.6
Status: Closed
Resolution: Accepted
Priority: 5
Private: No
Submitted By: Denys Rtveliashvili (rtvd)
Assigned to: Nobody/Anonymous (nobody)
Summary: Adding a testcase for the bug in find_longest_match

Initial Comment:
The find_longest_match method in the difflib's SequenceMatcher has a bug.

The bug is in turn caused by a problem with creating a b2j mapping which should 
contain a list of indices for each of the list elements in b. However, when the 
b2j mapping is being created (this is being done in __chain_b method in the 
SequenceMatcher) the mapping becomes broken. The cause of this is that for the 
frequently used elements the list of indices is removed and the element is 
being enlisted in the populardict mapping.

The test case tries to match two strings like:

abbbbbb.... and ...bbbbbbc

The number of b is equal and the find_longest_match should have returned the 
proper amount. However, in case the number of "b"s is large enough, the method 
reports that the length of the longest common substring is 0. It simply can't 
find it.

A bug was raised some time ago on this matter. It's ID is 1528074.

I tried to fix this bug but as a result, the performance drops by a factor of 
5-10. Perhaps someone more familiar with Python can find a good fix. For the 
time being I send this test case (which is broken until the bug is fixed) and 
I'm going to send a fix next (but the fix makes the method run quite slowly).

The patch diff attached was made against the trunk version of Python.

----------------------------------------------------------------------

Comment By: Jim Jewett (jimjjewett)
Date: 2007-03-18 22:01

Message:
Logged In: YES 
user_id=764593
Originator: NO

I think the bug is documentation only.  

According to the comments (but not the docstring) find_longest_match
returns the longest _interesting_ match.  A single element appearing too
often is likely to cause spurious matches, and is therefore not
interesting.

I do agree that this should be noted more prominently, so people don't try
things like comparing text strings letter-by-letter (where 1% is far too
low a threshhold for a 26-character alphabet).

And yes, the comments on popular are correct -- it ignores elements which
constitute *more* than 1%.  

I recommend closing this set of tracker items. If you could submit changes
to the documentation (docstrings and/or help files; maybe even the
comments), I would recommend applying them.


----------------------------------------------------------------------

Comment By: Georg Brandl (gbrandl)
Date: 2007-03-18 14:28

Message:
Logged In: YES 
user_id=849994
Originator: NO

Added the testcase to Lib/test/outstanding_bugs.py. It will be moved to
test_difflib as soon as the bug is fixed.

----------------------------------------------------------------------

Comment By: Paul Hankin (paulhankin)
Date: 2007-03-18 13:46

Message:
Logged In: YES 
user_id=1740099
Originator: NO

The test correctly fails, and looks right.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=1678339&group_id=5470
_______________________________________________
Patches mailing list
Patches@python.org
http://mail.python.org/mailman/listinfo/patches

Reply via email to