[Patches] [ python-Patches-1678345 ] A fix for the bug #1528074 [warning: quite slow]

SourceForge.net Fri, 20 Apr 2007 23:36:05 -0700

Patches item #1678345, was opened at 2007-03-11 18:28
Message generated for change (Comment added) made by rtvd
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=1678345&group_id=5470


Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Library (Lib)
Group: Python 2.6
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Denys Rtveliashvili (rtvd)
Assigned to: Nobody/Anonymous (nobody)
Summary: A fix for the bug #1528074 [warning: quite slow]

Initial Comment:
This is a fix for the bug #1528074 in the difflib's SequenceMatcher which makes 
(among other possible things) find_longest_match return invalid results 
sometimes.

The previously submitted test case for this bug has the #1678339 ID.

The find_longest_match and __chain_b methods in the SequenceMatcher are 
perfectly optimized. The find_longest_match would work both fast and correctly 
if only the __chain_b did not break it's assumptions.

The find_longest_match assumes that the b2j mapping has a mapping of all 
non-junk elements in b to lists of their indices in the "b" list. However, when 
__chain_b creates the b2j mapping, it removes popular elements from the list 
and marking the elements as popular in the "populardict". As a result, the 
find_longest_match method can't find the indices for the popular elements and 
they become automatically considered as something like a junk.

I tried to fix the bug by both changing the find_longest_match and __chain_b 
methods. No matter how hard I tried, the change dropped the performance and 
slowed down the matching by 5-10 times. The impact of find_longest_match method 
was larger, so I decided to send a patch for the __chain_b.

Please, note, that even though the method starts to work properly and the test 
cases pass on my computer just fine, the ingenious optimizations performed 
before become broken, so it would be great if a guru in Python code 
optimization tries to improve the things a bit.

One more point: if the indices are not removed, the memory consumption on the 
large strings can become quite great. If this is a serious concern, a fix in 
the find_longest_match will need to be done instead. However that fix would 
probably be far less efficient that this one.

----------------------------------------------------------------------

>Comment By: Denys Rtveliashvili (rtvd)
Date: 2007-04-21 10:34

Message:
Logged In: YES 
user_id=1416496
Originator: YES

I am sure this is NOT a bug in the documentation. There is no such thing
as "interesting" match unless it can be described mathematically. Instead,
the documentation 
specifies _precisely_ what exactly the function must do in a very precise
manner.

The problem is that it does a different thing.

Another issue of this algorithm is that it uses a quite slow algorithm.
It's computational complexity is O(n*m). Another one based on generalized
suffix trees has a much more modest complexity O(n+m). The details are
here:
http://en.wikipedia.org/wiki/Longest-common_substring_problem

----------------------------------------------------------------------

Comment By: Jason Orendorff (jorend)
Date: 2007-04-21 01:13

Message:
Logged In: YES 
user_id=18139
Originator: NO

Recommend closing this.

----------------------------------------------------------------------

Comment By: Jim Jewett (jimjjewett)
Date: 2007-03-19 04:59

Message:
Logged In: YES 
user_id=764593
Originator: NO

I think the bug is documentation only.  

According to the comments (but not the docstring) find_longest_match
returns the longest _interesting_ match.  A single element appearing too
often is likely to cause spurious matches, and is therefore not
interesting.

I do agree that this should be noted more prominently, so people don't try
things like comparing text strings letter-by-letter (where 1% is far too
low a threshhold for a 26-character alphabet).

And yes, the comments on popular are correct -- it ignores elements which
constitute *more* than 1%.  

I recommend closing this set of tracker items. If you could submit changes
to the documentation (docstrings and/or help files; maybe even the
comments), I would recommend applying them.


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=1678345&group_id=5470
_______________________________________________
Patches mailing list
[email protected]
http://mail.python.org/mailman/listinfo/patches

[Patches] [ python-Patches-1678345 ] A fix for the bug #1528074 [warning: quite slow]

Reply via email to