[issue39891] [difflib] Improve get_close_matches() to better match when casing of words are different

brian.gallagher Sun, 08 Mar 2020 05:58:12 -0700


brian.gallagher <oss.brn...@gmail.com> added the comment:


I agree that there is an appeal to leaving any normalization to the application 
and that trying guess what people want is a tough hole -- I hadn't even 
considered what casing would mean in a general sense for Unicode.

I'm not entirely convinced that this should be pursued either, but I'll refine 
my proposal, provide a little context in which I thought it could be a problem 
and see what you guys think.

1. Some code is written that assumes get_close_matches() will match on a 
case-insensitive basis. Only a small bit of testing is done because the 
functionality is provided by the standard library not the application code, so 
we throw a few examples like 'apple' and 'ape' and decide it is okay. We later 
on discover we have a bug because we actually need to match against 'AppLE' too.

2. The extension I had in mind was to match on a case-insensitive basis for 
only the alphabet characters. I don't know much about Unicode, but there's 
definitely gotchas lurking in my previous statement (titlecase vs. uppercase) 
so copying the behaviour of string.upper()/string.lower() would seem reasonable 
to me. The functionality would only be extended to match the same strings it 
would anyways, but now ignore casing. We wouldn't be eliminating any existing 
matches. I guess this still has the potential to be a breaking change, since 
someone might indirectly be depending on this.

For 1., not testing that your code can handle mixed case comparisons in the way 
you're assuming it will is probably your own fault. On the other hand, I think 
it is a reasonable assumption to think that get_close_matches() will match an 
uppercase/lowercase counterpart since the function's intent is to provide 
intuitive matches that "look right" to a human. 

Maybe this is more of a documentation issue than something that needs to be 
addressed in the code. If a caveat about the case sensitivity of the function 
is added to the documentation, then a developer can be aware of the limitation 
in order to provide any normalization they want in the application code.

Let me know what you guys think.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue39891>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue39891] [difflib] Improve get_close_matches() to better match when casing of words are different

Reply via email to