On 4/18/07, Josiah Carlson <[EMAIL PROTECTED]> wrote: > "Jeffrey Yasskin" <[EMAIL PROTECTED]> wrote: > > I missed the beginning of this discussion, so sorry if you've already > > covered this. Are you saying that in your app, just because I've set > > the en_US locale, I won't be able to type "????"? Or that those > > characters won't be recognized as letters? > > If I understand the conversation correctly, the discussion is what will > be in string.letters, and what will be handled in str.upper(), etc., > when a locale is set.
string.letters should go away because I don't know of any correct uses of it, and as you say 40K letters is too long. Searching a list is the wrong way to decide whether a character is a letter, and case transformations don't work a character at a time (consider what happens with "ß".upper() (That is, U+00DF, German Small Sharp S)). http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt defines the mappings that aren't 1-1. There are some that are locale-specific, but you can do a pretty good job ignoring the language, as long as you allow strings to change length. > > The Unicode character database (http://www.unicode.org/ucd/) seems > > like the obvious way to handle character properties if you want to get > > the right answers. > > Certainly, but having 40k characters in string.letters seems like a bit > of overkill, for *any* locale. It seems as though it only makes sense > to include the letters for the current locale as string.letters, and to > handle str.upper(), etc., as determined by the locale. As far as I understand, "letters for the current locale" is the same as "letters" in Unicode. Can you point me to a character that is a letter in one locale but not in another? (The third column of http://www.unicode.org/Public/UNIDATA/UnicodeData.txt defines the character's category, and http://www.unicode.org/Public/UNIDATA/UCD.html#General_Category_Values says what it means.) > In terms of sorting, since all (unicode) strings should be comparable to > one another, using the unicode-specified ordering would seem to make > sense, unless it is something other than code point values. If it isn't > code point values (which seems to be the implication), then we need to > decide if we want to check a 128kbyte table (for UCS-2 builds) in order > to sort strings (though cache lookup locality may make this a moot point > for most comparisons). If you just need to store strings in an order-based data structure (which I guess is moot for python with its hashes), then codepoint order is fine. If you intend to show users a sorted list, then you have to use the real collation algorithm or you'll produce the wrong answer. I don't understand the algorithm's details, but ICU has an implementation, and http://icu-project.org/charts/icu4c_footprint.html claims that the data for all languages fits in 354K. UCS-2 is an old and broken fixed-width encoding that cannot represent characters above U+FFFF. Nobody should ever use it. You probably meant UTF-16. -- Namasté, Jeffrey Yasskin _______________________________________________ Python-3000 mailing list Python-3000@python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com