On 7/11/2010 11:02 PM, Tim Peters wrote:
The heuristic lowered the reported match ratio from .96 to .88, which
would be bad when one wanted the unaltered value.
BTW, it's not clear whether ratio() computes a _useful_ value in the
presence of junk, however that may be defined.
I agree, which is one reason why one should be to disable auto-junking.
There are a number of statistical methods for analyzing similarity
matrices, analogous to correlation matrices, except that entries are in
[0.0,1.0] rather than [-1.0,1.0]. For my Ph.D. thesis, I did such
analyses for sets of species. Similarity measures should ususally be
symmetric and increase with greater matching. The heuristic can cause
both to fail.
I suspect nobody cares ;-)
There are multiple possible definitions of similarity for sets (and
arguments thereabout). I am sure the same is true for sequences. But I
consider the definition for .ratio, without the heuristic, to be
sensible. I would consider using it should the occasion arise.
It certainly was the intent that nothing would be
called junk unless it appeared at least twice, so the "n>= 200"
clause ensures that 1% of n is at least 2.
Since 2 cannot be greater than something that is at least 2, you ensured
that nothing would be called junk unless it appears as least thrice.
However, I'm wary of introducing a generalization in the absence of
experience saying people would use it. Is this the right kind of
parametrization? Is this even the right kind of way to go about
auto-detecting junk? I know it worked great for the original use case
that drove it, but I haven't seen anyone say they want a notion of
auto-junk detection for other uses - just that they _don't_ want the
wholly inappropriate current auto-junk detection in (some or all) of
their uses.
IOW, it's hard to generalize confidently from a sample of one :-(
Implementation: Add a new parameter named 'common' or 'threshold' or
whatever that defaults to 1.
I'd call it "autojunk", cuz that's what it would do. Also a useful
syntactic overlap with the name of the current "isjunk" argument.
I like that. I am now leaning toward the following?
G (I hope, this time, for 'go' ;-). For 2.7.1, 3.1.3, and 3.2, add
'autojunk = True' to the constructor signature. This is the minimal
change that fixes the bug of no choice while keeping the default as is.
So it is a minimal violation of the usual stricture against API changes
in bugfix releases. I would doc this as "Use an internal heuristic that
identifies 'common' items as junk." and separately describe the 'current
heuristic', leaving open the possibility of changing it.
Possible suboption: enforce 'autojunk in (True,False)' so the user
cannot forget that it is a switch and not a tuning parameter.
In 3.2, expose as an attribute a tuple 'hueristic' or '_heuristic' with
the tuning parameters for the current heuristic. Adding the _ would
indicate that is it a private, expert-only, use at your own risk,
subject to change attribute.
If we agree on this much, we can then discuss what the tuple should be
for 3.2.
Other changes that apply regardless of the heuristic/api change:
Update the code to use sets (newer than difflib) instead of dicts with
values set to 1.
Directly expose the set of 'common' items as an additional attribute of
SequenceMatcher instances. Such instance attributes are currently
undocumented, so adding one can hardly be a problem. Add documention
thereof. Being able to see the effect of the heuristic when it is not turned
off might help people decide whether or not to use it, or how to tune the
threshold for smallish alphabets where 1 is too small.
Wholly agreed. junkdict (after turning it into a set) should also be
exposed - when someone passes in a fancy regexp matcher for the isjunk
argument, they can be surprised at what their regexp matches. Being
able to see the results can be helpful there too, for debugging.
I meant to include junkdict also.
--
Terry Jan Reedy
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com