New submission from Jonathan <bugrepo...@lightpear.com>:

The following two strings are identical other than the text "UNIQUESTRING".
UNIQUESTRING is at the start of first and at the end of second.
Running the below gives the following output:


0.99830220713073
0.99830220713073
0.023769100169779286  # ratio

0.99830220713073
0.99830220713073
0.023769100169779286  # ratio

As you can see, Ratio is basically 0. Remove either of the UNIQUESTRING pieces 
and it goes up to 0.98 (correct)... Remove both and you get 1.0 (correct)


```
from difflib import SequenceMatcher

first = """
UNIQUESTRING
Lorem Ipsum is simply dummy text of the printing and typesetting industry. 
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, 
when an unknown printer took a galley of type and scrambled it to make a type 
specimen book. It has survived not only five centuries, but also the leap into 
electronic typesetting, remaining essentially unchanged. It was popularised in 
the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, 
and more recently with desktop publishing software like Aldus PageMaker 
including versions of Lorem Ipsum
"""


second = """

Lorem Ipsum is simply dummy text of the printing and typesetting industry. 
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, 
when an unknown printer took a galley of type and scrambled it to make a type 
specimen book. It has survived not only five centuries, but also the leap into 
electronic typesetting, remaining essentially unchanged. It was popularised in 
the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, 
and more recently with desktop publishing software like Aldus PageMaker 
including versions of Lorem Ipsum  UNIQUESTRING
"""

sm = SequenceMatcher(None, first, second, autojunk=False)
print(sm.real_quick_ratio())
print(sm.quick_ratio())
print(sm.ratio())

print()

sm2 = SequenceMatcher(None, second, first, autojunk=False)
print(sm2.real_quick_ratio())
print(sm2.quick_ratio())
print(sm2.ratio())

```

If I add `autojunk=False`, then I get a correct looking ratio (0.98...), 
however from my reading of the autojunk docs, UNIQUESTRING shouldn't be 
triggering it. Furthermore, looking in the code, as far as I can see autojunk 
is having no effect...

Autojunk considers these items to be "popular" in that string:
`{'n', 'p', 'a', 'h', 'e', 'u', 'I', 'r', 'k', 'g', 'y', 'm', 'c', 'd', 't', 
'l', 'o', 's', ' ', 'i'}`

If I remove UNIQUESTRING from `first`, this is the autojunk popular set:
`{'c', 'p', 'a', 'u', 'r', 'm', 'k', 'g', 'I', 'd', ' ', 'o', 'h', 't', 'e', 
'i', 'l', 's', 'y', 'n'}`

They're identical!

In both scenarios, `b2j` is also identical.

I don't pretend to understand what the module is doing in any detail, but this 
certainly seems like a false positive/negative.

Python 3.8.10

----------
components: Library (Lib)
messages: 412673
nosy: jonathan-lp
priority: normal
severity: normal
status: open
title: SequenceMatcher & autojunk - false negative
type: behavior
versions: Python 3.8

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue46667>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to