New submission from Jonathan <bugrepo...@lightpear.com>:
The following two strings are identical other than the text "UNIQUESTRING". UNIQUESTRING is at the start of first and at the end of second. Running the below gives the following output: 0.99830220713073 0.99830220713073 0.023769100169779286 # ratio 0.99830220713073 0.99830220713073 0.023769100169779286 # ratio As you can see, Ratio is basically 0. Remove either of the UNIQUESTRING pieces and it goes up to 0.98 (correct)... Remove both and you get 1.0 (correct) ``` from difflib import SequenceMatcher first = """ UNIQUESTRING Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum """ second = """ Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum UNIQUESTRING """ sm = SequenceMatcher(None, first, second, autojunk=False) print(sm.real_quick_ratio()) print(sm.quick_ratio()) print(sm.ratio()) print() sm2 = SequenceMatcher(None, second, first, autojunk=False) print(sm2.real_quick_ratio()) print(sm2.quick_ratio()) print(sm2.ratio()) ``` If I add `autojunk=False`, then I get a correct looking ratio (0.98...), however from my reading of the autojunk docs, UNIQUESTRING shouldn't be triggering it. Furthermore, looking in the code, as far as I can see autojunk is having no effect... Autojunk considers these items to be "popular" in that string: `{'n', 'p', 'a', 'h', 'e', 'u', 'I', 'r', 'k', 'g', 'y', 'm', 'c', 'd', 't', 'l', 'o', 's', ' ', 'i'}` If I remove UNIQUESTRING from `first`, this is the autojunk popular set: `{'c', 'p', 'a', 'u', 'r', 'm', 'k', 'g', 'I', 'd', ' ', 'o', 'h', 't', 'e', 'i', 'l', 's', 'y', 'n'}` They're identical! In both scenarios, `b2j` is also identical. I don't pretend to understand what the module is doing in any detail, but this certainly seems like a false positive/negative. Python 3.8.10 ---------- components: Library (Lib) messages: 412673 nosy: jonathan-lp priority: normal severity: normal status: open title: SequenceMatcher & autojunk - false negative type: behavior versions: Python 3.8 _______________________________________ Python tracker <rep...@bugs.python.org> <https://bugs.python.org/issue46667> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com