Re: [Python-Dev] Issue 2986: difflib.SequenceMatcher is partly broken

Terry Reedy Wed, 07 Jul 2010 18:06:17 -0700

I had considered the possibility of option A for 2.7 and A & C for 3.2.But see below.

Since posting, I did an experiment with a 700 char paragraph of text(the summary from the post) compared to an 'edited' version. I did thecomparision with and without the current heuristic. I did not noticemuch time difference (a couple of seconds either way) and the edit listwas essentially the same. The heuristic lowered the reported match ratiofrom .96 to .88, which would be bad when one wanted the unaltered value.

I do not know which, if any, chars other than 'e' were junked as thatinfo currently is not exposed. I propose below that it should be.


I intentionally did not list as options

D. Keep the status quo that is buggy for certain uses.

Good things often have more uses than the inventor intended or expected.They should not be prevented.

E. Completely remove the heuristic, which would restore 'buggy'performance for other uses.


One of the closed issues was E, rejected for that reason.

---

I also did not list one of my original ideas, but after the discussionit is looking better to me. It is based on the following statement ofthe current heuristic:

"Disregard as junk common items that occur in more that 1% of thepositions in second sequence b, as long as b is long enough so thatduplicates cannot be called common."

Tim, I do not know if you remember why you choose 200 as the cutoff, butthe above assumes that the following in not just a coincidence:


(2 > 199*.01) == True
(2 > 200*.01) == False

In other words, 200 is the smallest length for b that prevents thejunking of duplicates.

F. Generalize the heuristic by replacing '1' with 'k', where k can beNone (no heuristic) or 1-99. If not None, replace 200 by 200/k tominimally avoid junking of duplicates. If len(b) >= 200/k, then itemcounts should be greater than (len(b)*k)//100, which need only becalculated once.

Implementation: Add a new parameter named 'common' or 'threshold' orwhatever that defaults to 1. After computing b2j without the heuristic,if 'common' is not None, prune b2j as described above.

My thinking here is that a user either knows or can easily find out thelength of b and the size of the intented or actual alphabet of b (thelatter is len(set(b)). So the user can conditionally callSequenceMatcher with 'common' set to None or an int as appropriate,perhaps after some experimentation. So the threshold is the only tuningparameter actually needed, and it also allows the heuristic to be turnedoff.

The reason I did not list this before is the problem with 2.7.1. F,unlike option A, overtly changes the api, and some would object to thatfor 2.7 even though is really is a bugfix. However, option F will notnot break code while the covert change of option A could break code. Sothis may be the best option for a bad situation. It is a minimal changethat gives the user complete flexibility.


In other words, I see three options for 2.7.1+:
D. keep the current sometimes buggy behavior

A. covertly change the heuristic to mostly fix it but possibly breaksome uses.F. add a parameter, with a no-change default, that allows the user toselect a change if and when wanted.


Another advantage of F is that then 2.7 and 3.2 would get the same change.

--
Other changes that apply regardless of the heuristic/api change:

Update the code to use sets (newer than difflib) instead of dicts withvalues set to 1.

Directly expose the set of 'common' items as an additional attribute ofSequenceMatcher instances. Such instance attributes are currentlyundocumented, so adding one can hardly be a problem. Add documentionthereof. Being able to see the effect of the heuristic when it is notturned off might help people decide whether or not to use it, or how totune the threshold for smallish alphabets where 1 is too small.


--
Terry Jan Reedy

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Issue 2986: difflib.SequenceMatcher is partly broken

Reply via email to