[issue12753] \N{...} neglects formal aliases and named sequences from Unicode charnames namespace

Tom Christiansen Mon, 03 Oct 2011 11:57:31 -0700

Tom Christiansen <[email protected]> added the comment:

Ezio Melotti <[email protected]> wrote
   on Mon, 03 Oct 2011 04:15:51 -0000:


>> But it still has to happen at compile time, of course, so I don't know
>> what you could do in Python.  Is there any way to change how the compiler
>> behaves even vaguely along these lines?

> I think things like "from __future__ import ..." do something similar,
> but I'm not sure it will work in this case (also because you will have
> to provide the list of aliases somehow).

Ah yes, that's right.  Hm.  I bet then it *would* be possible, just perhaps
a bit of a run-around to get there.  Not a high priority, but interesting.

> less readable than:
> 
> def my_capitalize(s):
>    return s[0].upper() + s[1:].lower()

> You could argue that the first is much more explicit and in a way
> clearer, but overall I think you agree with me that is less readable.

Certainly.

It's a bit like the way bug rate per lines of code is invariant across
programming languages.  When you have more opcodes, it gets harder to
understand because there are more interactions and things to remember.

>> That really isn't right.  A cased character is one with the Unicode "Cased"
>> property, and a lowercase character is one wiht the Unicode "Lowercase"
>> property.  The General Category is actually immaterial here.

> You might want to take a look and possibly add a comment on #12204 about this.

>> I've spent all bloody day trying to model Python's islower, isupper, and 
>> istitle
>> functions, but I get all kinds of errors, both in the definitions and in the
>> models of the definitions.

> If by "model" you mean "trying to figure out how they work", it's
> probably easier to look at the implementation (I assume you know
> enough C to understand what they do).  You can find the code for
> str.istitle() at http://hg.python.org/cpython/file/default/Objects/un-
> icodeobject.c#l10358 and the actual implementation of some macros like
> Py_UNICODE_ISTITLE at
> http://hg.python.org/cpython/file/default/Objects/unicodectype.c.

Thanks, that helps immensely.  I'm completely fluent in C.  I've gone 
and built a tags file of your whole v3.2 source tree to help me navigate.

The main underlying problem is that the internal macros are defined in a
way that made sense a long time ago, but no longer do ever since (for
example) the Unicode lowercase property stopped being synonymous with
GC=Ll and started also including all code points with the
Other_Lowercase property as well.

The originating culprit is Tools/unicode/makeunicodedata.py.
It builds your tables only using UnicodeData.txt, which is
not enough.  For example:

    if category in ["Lm", "Lt", "Lu", "Ll", "Lo"]:
        flags |= ALPHA_MASK
    if category == "Ll":
        flags |= LOWER_MASK
    if 'Line_Break' in properties or bidirectional == "B":
        flags |= LINEBREAK_MASK
        linebreaks.append(char)
    if category == "Zs" or bidirectional in ("WS", "B", "S"):
        flags |= SPACE_MASK
        spaces.append(char)
    if category == "Lt":
        flags |= TITLE_MASK
    if category == "Lu":
        flags |= UPPER_MASK

It needs to use DerivedCoreProperties.txt to figure out whether
something is Other_Uppercase, Other_Lowercase, etc. In particular:

    Alphabetic := Lu+Ll+Lt+Lm+Lo + Nl + Other_Alphabetic
    Lowercase  := Ll + Other_Lowercase
    Uppercase  := Ll + Other_Uppercase

This affects a lot of things, but you should be able to just fix it
in Tools/unicode/makeunicodedata.py and have all of them start
working correctly.

You will probably also want to add 

    Py_UCS4 _PyUnicode_IsWord(Py_UCS4 ch)

that uses the UTS#18 Annex C definition, so that you catch marks, too.
That definition is:

    Word := Alphabetic + Mc+Me+Mn + Nd + Pc

where Alphabetic is defined above to include Nl and Other_Alphabetic.

Soemwhat related is stuff like this:

    typedef struct {
        const Py_UCS4 upper;
        const Py_UCS4 lower;
        const Py_UCS4 title;
        const unsigned char decimal;
        const unsigned char digit;
        const unsigned short flags;
    } _PyUnicode_TypeRecord;

There are two different bugs here.  First, you are missing 

        const Py_UCS4 fold;

which is another field from UnicodeData.txt, one that is critical 
for doing case-insensitive matches correctly.

Second, there's also the problem that Py_UCS4 is an int.  That means you
are stuck with just the character-based simple versions of upper-, title-,
lower-, and foldcase.  You need to have fields for the full mappings, which
are now strings (well, int arrays) not single ints.  I'll use ??? for the
int-array type that I don't know:

        const ??? upper_full;
        const ??? lower_full;
        const ??? title_full;
        const ??? fold_full;

You will also need to extend the API from just

    Py_UCS4 _PyUnicode_ToUppercase(Py_UCS4 ch)

to something like

    ??? _PyUnicode_ToUppercase_Full(Py_UCS4 ch)

I don't know what the ??? return type is there, but it's whatever the
upper_full filed in _PyUnicode_TypeRecord would be.

I know that Matthew Barnett has had to cover a bunch of these for his regex
module, including generating his own tables.  It might be possible to
piggy-back on that effort; certainly it would be desirable to try.

> I really don't understand any of these functions.  I'm very sad.  I think 
> they are
> wrong, but maybe I am.  It is extremely confusing.

>> Shall I file a separate bug report?

> If after reading the code and/or the documentation you still think
> they are broken and/or that they can be improved, then you can open
> another issue.

I handn't actually *looked* at capitalize yet, because I stumbled over
these errors in the way-underlying code that necessarily supports it.
The errors in definitions explain a lot of what I was 

Ok, more bugs.  Consider this:

    static 
    int fixcapitalize(PyUnicodeObject *self)
    {
        Py_ssize_t len = self->length;
        Py_UNICODE *s = self->str;
        int status = 0;

        if (len == 0)
            return 0;
        if (Py_UNICODE_ISLOWER(*s)) {
            *s = Py_UNICODE_TOUPPER(*s);
            status = 1;
        }
        s++;
        while (--len > 0) {
            if (Py_UNICODE_ISUPPER(*s)) {
                *s = Py_UNICODE_TOLOWER(*s);
                status = 1;
            }
            s++;
        }
        return status;
    }

There are several bugs there.  First, you have to use the TITLECASE if there
is one, and only use the uppercase if there is no titlecase.  Uppercase
is wrong.

Second, you cannot decide to do the case change only if it starts out as a
certain case.  You have to do it unconditionally, especially since your
tests for whether something is upper or lower are wrong.  For example,
Roman numerals, the iota subscript, the circled letters, and a few other
things all are case-changing but are not themselves Letters in the
GC=Ll/Lu/Lt sense.  Also, there are also cased letters in the GC=Lm
category, which you miss.  Unicode has properties like Cased that you
should be using to determine whether something is cased.  It also have
properties like Changes_When_Uppercased (aka CWU) that tell you whether
something will change.  For example, most of the small capitals are cased
code points that are considered lowercase and which do not change when
uppercase.  However, The LATIN SMALL CAPITAL R (which is a lowercase code
point) actually does have an uppercase mapping.  Strange but true.

Does this help at all?  I have to go to a meeting now.

--tom

----------
title: \N{...} neglects formal aliases and named sequences from Unicode 
charnames namespace -> \N{...} neglects formal aliases and named sequences from 
Unicode charnames namespace

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue12753>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12753] \N{...} neglects formal aliases and named sequences from Unicode charnames namespace

Reply via email to