New submission from Giacomo <[email protected]>:
Here I propose a new function, namely .ratio_min(self,m).
.ratio_min(self,m) is an extension of the difflib's function .ratio(self).
Equivalently to .ratio(self), .ratio_min(self,m) returns a measure of two
sequences' similarity (float in [0,1]). In addition to .ratio(), it can ignore
matched substrings if these substrings have length less than a given threshold
m. m is the second variable of the function.
It is very useful to avoid spurious high similarity scores.
# NEW FUNCTION:
def ratio_min(self,m):
"""Return a measure of the sequences' similarity (float in [0,1]).
Where T is the total number of elements in both sequences, and
M_min is the number of matches with every single match has length at
least m, this is 2.0*M_min / T.
Note that this is 1 if the sequences are identical, and 0 if
they have no substring of length m or more in common.
.ratio_min() is similar to .ratio().
.ratio_min(1) is equivalent to .ratio().
>>> s = SequenceMatcher(None, "abcd", "bcde")
>>> s.ratio_min(1)
0.75
>>> s.ratio_min(2)
0.75
>>> s.ratio_min(3)
0.75
>>> s.ratio_min(4)
0.0
"""
matches = sum(triple[-1] for triple in self.get_matching_blocks() if
triple[-1] >=m)
return _calculate_ratio(matches, len(self.a) + len(self.b))
----------
components: Library (Lib)
messages: 408622
nosy: gibu
priority: normal
severity: normal
status: open
title: Add ratio_min() function to the difflib library
type: enhancement
versions: Python 3.10, Python 3.11
_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue46086>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com