Tom Christiansen <tchr...@perl.com> added the comment: Ezio Melotti <rep...@bugs.python.org> wrote on Mon, 03 Oct 2011 04:15:51 -0000:
>> But it still has to happen at compile time, of course, so I don't know >> what you could do in Python. Is there any way to change how the compiler >> behaves even vaguely along these lines? > I think things like "from __future__ import ..." do something similar, > but I'm not sure it will work in this case (also because you will have > to provide the list of aliases somehow). Ah yes, that's right. Hm. I bet then it *would* be possible, just perhaps a bit of a run-around to get there. Not a high priority, but interesting. > less readable than: > > def my_capitalize(s): > return s[0].upper() + s[1:].lower() > You could argue that the first is much more explicit and in a way > clearer, but overall I think you agree with me that is less readable. Certainly. It's a bit like the way bug rate per lines of code is invariant across programming languages. When you have more opcodes, it gets harder to understand because there are more interactions and things to remember. >> That really isn't right. A cased character is one with the Unicode "Cased" >> property, and a lowercase character is one wiht the Unicode "Lowercase" >> property. The General Category is actually immaterial here. > You might want to take a look and possibly add a comment on #12204 about this. >> I've spent all bloody day trying to model Python's islower, isupper, and >> istitle >> functions, but I get all kinds of errors, both in the definitions and in the >> models of the definitions. > If by "model" you mean "trying to figure out how they work", it's > probably easier to look at the implementation (I assume you know > enough C to understand what they do). You can find the code for > str.istitle() at http://hg.python.org/cpython/file/default/Objects/un- > icodeobject.c#l10358 and the actual implementation of some macros like > Py_UNICODE_ISTITLE at > http://hg.python.org/cpython/file/default/Objects/unicodectype.c. Thanks, that helps immensely. I'm completely fluent in C. I've gone and built a tags file of your whole v3.2 source tree to help me navigate. The main underlying problem is that the internal macros are defined in a way that made sense a long time ago, but no longer do ever since (for example) the Unicode lowercase property stopped being synonymous with GC=Ll and started also including all code points with the Other_Lowercase property as well. The originating culprit is Tools/unicode/makeunicodedata.py. It builds your tables only using UnicodeData.txt, which is not enough. For example: if category in ["Lm", "Lt", "Lu", "Ll", "Lo"]: flags |= ALPHA_MASK if category == "Ll": flags |= LOWER_MASK if 'Line_Break' in properties or bidirectional == "B": flags |= LINEBREAK_MASK linebreaks.append(char) if category == "Zs" or bidirectional in ("WS", "B", "S"): flags |= SPACE_MASK spaces.append(char) if category == "Lt": flags |= TITLE_MASK if category == "Lu": flags |= UPPER_MASK It needs to use DerivedCoreProperties.txt to figure out whether something is Other_Uppercase, Other_Lowercase, etc. In particular: Alphabetic := Lu+Ll+Lt+Lm+Lo + Nl + Other_Alphabetic Lowercase := Ll + Other_Lowercase Uppercase := Ll + Other_Uppercase This affects a lot of things, but you should be able to just fix it in Tools/unicode/makeunicodedata.py and have all of them start working correctly. You will probably also want to add Py_UCS4 _PyUnicode_IsWord(Py_UCS4 ch) that uses the UTS#18 Annex C definition, so that you catch marks, too. That definition is: Word := Alphabetic + Mc+Me+Mn + Nd + Pc where Alphabetic is defined above to include Nl and Other_Alphabetic. Soemwhat related is stuff like this: typedef struct { const Py_UCS4 upper; const Py_UCS4 lower; const Py_UCS4 title; const unsigned char decimal; const unsigned char digit; const unsigned short flags; } _PyUnicode_TypeRecord; There are two different bugs here. First, you are missing const Py_UCS4 fold; which is another field from UnicodeData.txt, one that is critical for doing case-insensitive matches correctly. Second, there's also the problem that Py_UCS4 is an int. That means you are stuck with just the character-based simple versions of upper-, title-, lower-, and foldcase. You need to have fields for the full mappings, which are now strings (well, int arrays) not single ints. I'll use ??? for the int-array type that I don't know: const ??? upper_full; const ??? lower_full; const ??? title_full; const ??? fold_full; You will also need to extend the API from just Py_UCS4 _PyUnicode_ToUppercase(Py_UCS4 ch) to something like ??? _PyUnicode_ToUppercase_Full(Py_UCS4 ch) I don't know what the ??? return type is there, but it's whatever the upper_full filed in _PyUnicode_TypeRecord would be. I know that Matthew Barnett has had to cover a bunch of these for his regex module, including generating his own tables. It might be possible to piggy-back on that effort; certainly it would be desirable to try. > I really don't understand any of these functions. I'm very sad. I think > they are > wrong, but maybe I am. It is extremely confusing. >> Shall I file a separate bug report? > If after reading the code and/or the documentation you still think > they are broken and/or that they can be improved, then you can open > another issue. I handn't actually *looked* at capitalize yet, because I stumbled over these errors in the way-underlying code that necessarily supports it. The errors in definitions explain a lot of what I was Ok, more bugs. Consider this: static int fixcapitalize(PyUnicodeObject *self) { Py_ssize_t len = self->length; Py_UNICODE *s = self->str; int status = 0; if (len == 0) return 0; if (Py_UNICODE_ISLOWER(*s)) { *s = Py_UNICODE_TOUPPER(*s); status = 1; } s++; while (--len > 0) { if (Py_UNICODE_ISUPPER(*s)) { *s = Py_UNICODE_TOLOWER(*s); status = 1; } s++; } return status; } There are several bugs there. First, you have to use the TITLECASE if there is one, and only use the uppercase if there is no titlecase. Uppercase is wrong. Second, you cannot decide to do the case change only if it starts out as a certain case. You have to do it unconditionally, especially since your tests for whether something is upper or lower are wrong. For example, Roman numerals, the iota subscript, the circled letters, and a few other things all are case-changing but are not themselves Letters in the GC=Ll/Lu/Lt sense. Also, there are also cased letters in the GC=Lm category, which you miss. Unicode has properties like Cased that you should be using to determine whether something is cased. It also have properties like Changes_When_Uppercased (aka CWU) that tell you whether something will change. For example, most of the small capitals are cased code points that are considered lowercase and which do not change when uppercase. However, The LATIN SMALL CAPITAL R (which is a lowercase code point) actually does have an uppercase mapping. Strange but true. Does this help at all? I have to go to a meeting now. --tom ---------- title: \N{...} neglects formal aliases and named sequences from Unicode charnames namespace -> \N{...} neglects formal aliases and named sequences from Unicode charnames namespace _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue12753> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com