[Python-ideas] Re: New explicit methods to trim strings

Richard Damon Mon, 23 Mar 2020 04:41:08 -0700

On 3/23/20 4:33 AM, Chris Angelico wrote:
> On Mon, Mar 23, 2020 at 7:06 PM Alex Hall <alex.moj...@gmail.com> wrote:
>> I think I'm missing something, why is case insensitivity a mess?
>>
> Because there are many characters that case fold in strange ways.
> "ıIiİ".casefold() == 'ıiii̇' which means that lowercase dotless ı
> doesn't casefold to the same thing that uppercase dotless I. Some
> characters case fold to strings of different lengths, such as "ß"
> which casefolds to "ss". I haven't even tried what happens with
> combining characters vs combined characters. And Unicode case folding
> is already a simplified version of reality; what actual humans expect
> can be even more complicated, such as (I think) German case folding
> rules being different for names and for book titles, and the way that
> umlauted letters are case folded.
>
> On the other hand, this might actually mean it's *better* to have a
> dedicated case-insensitive-cut-prefix operation. It would be difficult
> to define it in easy terms, but basically it should be such that the
> returned string (if not identical to the original) is the longest
> suffix to the original string such that, if the returned string were
> appended to the prefix and the result case folded, it would be the
> same as the original string case folded. But there could be other
> definitions, just as complicated, and not necessarily more correct.
>
> In any case, this can (and in my opinion should) be deferred for
> later. Start with the simple one that doesn't care about all these
> complexities, and then expand from there as the need is found.
>
The issue is that cases in Unicode are difficult, and can be locale
dependent (Unicode calls this Tailoring).


In the above example with the i-s, casefold would have needed to be told
that we were dealing with the Turkish Language (or maybe some other
language with the same issue), but currently the Python casefold
function doesn't support the needed Tailoring (and I don't know if there
is an exhaustive listing somewhere of the needed tailoring)

Fully handling Unicode so as to meet all National expectations is VERY
difficult, It doesn't surprise me that the Python Standard Library
doesn't attempt to get it totally right, but settles for just dealing
with the 'default' processing. The biggest part of this mess is that
Unicode had to accept some compromises in defining Unicode (because the
languages themselves present problems and inconsistencies), and when you
hit a spot where the compromise goes against what you are trying to do
at the moment, it gets difficult.

-- 
Richard Damon
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/Q2COUQV323JKW2FEANXXHCXEP3RWXV2P/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: New explicit methods to trim strings

Reply via email to