New submission from Lewis Haley:
Consider the following snippet:
import difflib
first = u'location,location,location'
for second in (
u'location.location.location', # two periods (no commas)
u'location.location,location', # period after first
u'location,location.location', # period after second
u'location,location,location', # perfect match
):
edit_dist = difflib.SequenceMatcher(None, first, second).ratio()
print("comparing %r vs. %r gives edit dist: %g" % (first, second,
edit_dist))
I would expect the second and third tests to give the same result, but in
reality:
comparing u'location,location,location' vs. u'location.location.location' gives
edit dist: 0.923077
comparing u'location,location,location' vs. u'location.location,location' gives
edit dist: 0.653846
comparing u'location,location,location' vs. u'location,location.location' gives
edit dist: 0.961538
comparing u'location,location,location' vs. u'location,location,location' gives
edit dist: 1
The same results are received from Python 3.4.
>From experimenting, it seems that when the period comes after the first
>"location", the longest match found is the final two "locations" from the
>first string against the first two "locations" from the second string.
In [31]: difflib.SequenceMatcher(None, u'location,location,location',
u'location.location,location').ratio()
Out[31]: 0.6538461538461539
In [32]: difflib.SequenceMatcher(None, u'location,location,location',
u'location.location,location').get_matching_blocks()
Out[32]: [Match(a=0, b=9, size=17), Match(a=26, b=26, size=0)]
In [33]: difflib.SequenceMatcher(None, u'location,location,location',
u'location,location.location').ratio()Out[33]: 0.9615384615384616
In [34]: difflib.SequenceMatcher(None, u'location,location,location',
u'location,location.location').get_matching_blocks()
Out[34]:
[Match(a=0, b=0, size=17),
Match(a=18, b=18, size=8),
Match(a=26, b=26, size=0)]
Using `quick_ratio` instead of `ratio` gives (what I consider to be) the
correct result.
----------
components: Library (Lib)
files: test.py
messages: 252925
nosy: Lewis Haley
priority: normal
severity: normal
status: open
title: difflib.SequenceMatcher(...).ratio gives bad/wrong/unexpected low value
with repetitous strings
versions: Python 2.7, Python 3.4
Added file: http://bugs.python.org/file40767/test.py
_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue25391>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com