Re: [Python-Dev] Issue 2986: difflib.SequenceMatcher is partly broken

Terry Reedy Mon, 12 Jul 2010 19:29:57 -0700

On 7/11/2010 11:02 PM, Tim Peters wrote:

The heuristic lowered the reported match ratio from .96 to .88, which
would be bad when one wanted the unaltered value.

BTW, it's not clear whether ratio() computes a _useful_ value in the
presence of junk, however that may be defined.


I agree, which is one reason why one should be to disable auto-junking.

There are a number of statistical methods for analyzing similaritymatrices, analogous to correlation matrices, except that entries are in[0.0,1.0] rather than [-1.0,1.0]. For my Ph.D. thesis, I did suchanalyses for sets of species. Similarity measures should ususally besymmetric and increase with greater matching. The heuristic can causeboth to fail.

I suspect nobody cares ;-)

There are multiple possible definitions of similarity for sets (andarguments thereabout). I am sure the same is true for sequences. But Iconsider the definition for .ratio, without the heuristic, to besensible. I would consider using it should the occasion arise.

It certainly was the intent that nothing would be
called junk unless it appeared at least twice, so the "n>= 200"
clause ensures that 1% of n is at least 2.

Since 2 cannot be greater than something that is at least 2, you ensuredthat nothing would be called junk unless it appears as least thrice.

However, I'm wary of introducing a generalization in the absence of
experience saying people would use it.  Is this the right kind of
parametrization?  Is this even the right kind of way to go about
auto-detecting junk?  I know it worked great for the original use case
that drove it, but I haven't seen anyone say they want a notion of
auto-junk detection for other uses - just that they _don't_ want the
wholly inappropriate current auto-junk detection in (some or all) of
their uses.

IOW, it's hard to generalize confidently from a sample of one :-(

Implementation: Add a new parameter named 'common' or 'threshold' or
whatever that defaults to 1.


I'd call it "autojunk", cuz that's what it would do.  Also a useful
syntactic overlap with the name of the current "isjunk" argument.


I like that. I am now leaning toward the following?

G (I hope, this time, for 'go' ;-). For 2.7.1, 3.1.3, and 3.2, add'autojunk = True' to the constructor signature. This is the minimalchange that fixes the bug of no choice while keeping the default as is.So it is a minimal violation of the usual stricture against API changesin bugfix releases. I would doc this as "Use an internal heuristic thatidentifies 'common' items as junk." and separately describe the 'currentheuristic', leaving open the possibility of changing it.

Possible suboption: enforce 'autojunk in (True,False)' so the usercannot forget that it is a switch and not a tuning parameter.

In 3.2, expose as an attribute a tuple 'hueristic' or '_heuristic' withthe tuning parameters for the current heuristic. Adding the _ wouldindicate that is it a private, expert-only, use at your own risk,subject to change attribute.

If we agree on this much, we can then discuss what the tuple should befor 3.2.

Other changes that apply regardless of the heuristic/api change:

Update the code to use sets (newer than difflib) instead of dicts with
values set to 1.

Directly expose the set of 'common' items as an additional attribute of
SequenceMatcher instances. Such instance attributes are currently
undocumented, so adding one can hardly be a problem. Add documention
thereof. Being able to see the effect of the heuristic when it is not turned
off might help people decide whether or not to use it, or how to tune the
threshold for smallish alphabets where 1 is too small.


Wholly agreed.  junkdict (after turning it into a set) should also be
exposed - when someone passes in a fancy regexp matcher for the isjunk
argument, they can be surprised at what their regexp matches.  Being
able to see the results can be helpful there too, for debugging.


I meant to include junkdict also.


--
Terry Jan Reedy

_______________________________________________
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Issue 2986: difflib.SequenceMatcher is partly broken

Reply via email to