"Jeffrey Yasskin" <[EMAIL PROTECTED]> wrote: > On 4/18/07, Josiah Carlson <[EMAIL PROTECTED]> wrote: > > "Jeffrey Yasskin" <[EMAIL PROTECTED]> wrote: > > > I missed the beginning of this discussion, so sorry if you've already > > > covered this. Are you saying that in your app, just because I've set > > > the en_US locale, I won't be able to type "????"? Or that those > > > characters won't be recognized as letters? > > > > If I understand the conversation correctly, the discussion is what will > > be in string.letters, and what will be handled in str.upper(), etc., > > when a locale is set. > > string.letters should go away because I don't know of any correct uses > of it, and as you say 40K letters is too long. Searching a list is the > wrong way to decide whether a character is a letter, and case > transformations don't work a character at a time (consider what > happens with "Ã".upper() (That is, U+00DF, German Small Sharp S)). > http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt defines the > mappings that aren't 1-1. There are some that are locale-specific, but > you can do a pretty good job ignoring the language, as long as you > allow strings to change length.
Because we aren't mutating unicode strings, this isn't an issue. I respond below regarding string.letters . > > > The Unicode character database (http://www.unicode.org/ucd/) seems > > > like the obvious way to handle character properties if you want to get > > > the right answers. > > > > Certainly, but having 40k characters in string.letters seems like a bit > > of overkill, for *any* locale. It seems as though it only makes sense > > to include the letters for the current locale as string.letters, and to > > handle str.upper(), etc., as determined by the locale. > > As far as I understand, "letters for the current locale" is the same > as "letters" in Unicode. Can you point me to a character that is a > letter in one locale but not in another? (The third column of > http://www.unicode.org/Public/UNIDATA/UnicodeData.txt defines the > character's category, and > http://www.unicode.org/Public/UNIDATA/UCD.html#General_Category_Values > says what it means.) Neither I, nor I believe Python mean 'letters' in the general sense, but the 'alphabet' of a particular locale. For example, en_US compared to sv_SE . > > In terms of sorting, since all (unicode) strings should be comparable to > > one another, using the unicode-specified ordering would seem to make > > sense, unless it is something other than code point values. If it isn't > > code point values (which seems to be the implication), then we need to > > decide if we want to check a 128kbyte table (for UCS-2 builds) in order > > to sort strings (though cache lookup locality may make this a moot point > > for most comparisons). > > If you just need to store strings in an order-based data structure > (which I guess is moot for python with its hashes), then codepoint > order is fine. If you intend to show users a sorted list, then you > have to use the real collation algorithm or you'll produce the wrong > answer. I don't understand the algorithm's details, but ICU has an > implementation, and http://icu-project.org/charts/icu4c_footprint.html > claims that the data for all languages fits in 354K. It could probably even be reduced lower than 354K with two tables and a comparison function that knows how to handle surrogates. > UCS-2 is an old and broken fixed-width encoding that cannot represent > characters above U+FFFF. Nobody should ever use it. You probably meant > UTF-16. You are more or less right. Earlier versions of Windows were limited to UCS-2, and I believe earlier versions of Python on Windows were also limited to UCS-2. For narrow builds we use UTF-16, with surrogate pairs and everything (though a unicode string consisting of a single surrogate pair will have length 2, not 1 as would be expected). - Josiah _______________________________________________ Python-3000 mailing list [email protected] http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com
