On 3/23/20 4:33 AM, Chris Angelico wrote: > On Mon, Mar 23, 2020 at 7:06 PM Alex Hall <alex.moj...@gmail.com> wrote: >> I think I'm missing something, why is case insensitivity a mess? >> > Because there are many characters that case fold in strange ways. > "ıIiİ".casefold() == 'ıiii̇' which means that lowercase dotless ı > doesn't casefold to the same thing that uppercase dotless I. Some > characters case fold to strings of different lengths, such as "ß" > which casefolds to "ss". I haven't even tried what happens with > combining characters vs combined characters. And Unicode case folding > is already a simplified version of reality; what actual humans expect > can be even more complicated, such as (I think) German case folding > rules being different for names and for book titles, and the way that > umlauted letters are case folded. > > On the other hand, this might actually mean it's *better* to have a > dedicated case-insensitive-cut-prefix operation. It would be difficult > to define it in easy terms, but basically it should be such that the > returned string (if not identical to the original) is the longest > suffix to the original string such that, if the returned string were > appended to the prefix and the result case folded, it would be the > same as the original string case folded. But there could be other > definitions, just as complicated, and not necessarily more correct. > > In any case, this can (and in my opinion should) be deferred for > later. Start with the simple one that doesn't care about all these > complexities, and then expand from there as the need is found. > The issue is that cases in Unicode are difficult, and can be locale dependent (Unicode calls this Tailoring).
In the above example with the i-s, casefold would have needed to be told that we were dealing with the Turkish Language (or maybe some other language with the same issue), but currently the Python casefold function doesn't support the needed Tailoring (and I don't know if there is an exhaustive listing somewhere of the needed tailoring) Fully handling Unicode so as to meet all National expectations is VERY difficult, It doesn't surprise me that the Python Standard Library doesn't attempt to get it totally right, but settles for just dealing with the 'default' processing. The biggest part of this mess is that Unicode had to accept some compromises in defining Unicode (because the languages themselves present problems and inconsistencies), and when you hit a spot where the compromise goes against what you are trying to do at the moment, it gets difficult. -- Richard Damon _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-le...@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/Q2COUQV323JKW2FEANXXHCXEP3RWXV2P/ Code of Conduct: http://python.org/psf/codeofconduct/