Tom Christiansen <tchr...@perl.com> added the comment: > Martin v. Löwis <mar...@v.loewis.de> added the comment:
> "Split S into words. Change the first letter in a word to upper-case, Except that I think you actually mean that the first "letter" is changed into titlecase not uppercase. One might also say *try* to change for all these, in that not all cased code points in Unicode have casemaps that are different from themselves. For example, a superscript lowercase a or b has no distinct uppercase mapping, the way the non-superscript versions do: % (echo xyz; echo ab AB | unisupers) | uc XYZ ᵃᵇ ᴬᴮ > and all subsequent letters to lower case. A word is a sequence that > starts with a letter, followed by letter-related characters." I don't like the way you have defined letters and letter-related characters. The first already has a definition, which is not the one you are using. Word characters also has a definition in Unicode, and it is not the one you are using. I strongly advise against redefining standard Unicode properties. Choose other, unused terms if you must. It is very confusing otherwise. > Letters are all characters from the "Alphabetic" category, i.e. > Lu+Ll+Lt+Lm+Lo+Nl + Other_Alphabetic. Except that is exactly the definition of the Unicode Alphabetic property, not the Unicode Letter property. It is a mistake to equate Letter=Alphabetic, and very confusing too. I agree that this probably what you want, though. I just don't think you should use "letter-related characters" when there is an existing formal definition that works, or that you should redefine Letter. > "letter-related" characters are letters + marks (Mn, Mc, Me). That isn't quite right. * Letters are Lu+Ll+Lt+Lm+Lo. * Alphabetic is Letters + Other_Alphabetic. * Other_Alphabetic is certain marks (like the iota subscript) and the letter numbers (Nl), as well as a few symbols. * Word characters are Alphabetic + Mn+Mc+Me + Nd + Pc. I think you are looking for here are Word characters without Nd + Pc, so just Alphabetic + Mn+Mc+Me. Is that right? --tom PS: You can do union/intersection stuff with properties to see what the resulting sets look like using the unichars command-line tool. This is everything that is both alphabetic and also a mark: % unichars -gs '\p{Alphabetic}' '\pM' ○ͅ U+0345 GC=Mn SC=Inherited COMBINING GREEK YPOGEGRAMMENI ○ְ U+05B0 GC=Mn SC=Hebrew HEBREW POINT SHEVA ○ֱ U+05B1 GC=Mn SC=Hebrew HEBREW POINT HATAF SEGOL ○ֲ U+05B2 GC=Mn SC=Hebrew HEBREW POINT HATAF PATAH ○ֳ U+05B3 GC=Mn SC=Hebrew HEBREW POINT HATAF QAMATS ... ○ं U+0902 GC=Mn SC=Devanagari DEVANAGARI SIGN ANUSVARA ः U+0903 GC=Mc SC=Devanagari DEVANAGARI SIGN VISARGA ा U+093E GC=Mc SC=Devanagari DEVANAGARI VOWEL SIGN AA ि U+093F GC=Mc SC=Devanagari DEVANAGARI VOWEL SIGN I ी U+0940 GC=Mc SC=Devanagari DEVANAGARI VOWEL SIGN II ○ु U+0941 GC=Mn SC=Devanagari DEVANAGARI VOWEL SIGN U ○ू U+0942 GC=Mn SC=Devanagari DEVANAGARI VOWEL SIGN UU ○ृ U+0943 GC=Mn SC=Devanagari DEVANAGARI VOWEL SIGN VOCALIC R ○ॄ U+0944 GC=Mn SC=Devanagari DEVANAGARI VOWEL SIGN VOCALIC RR ... While these are the NON-alphabetic marks, which are still Word characters though of course: % unichars -gs '\P{Alphabetic}' '\pM' ○̀ U+0300 GC=Mn SC=Inherited COMBINING GRAVE ACCENT ○́ U+0301 GC=Mn SC=Inherited COMBINING ACUTE ACCENT ○̂ U+0302 GC=Mn SC=Inherited COMBINING CIRCUMFLEX ACCENT ○̃ U+0303 GC=Mn SC=Inherited COMBINING TILDE ○̄ U+0304 GC=Mn SC=Inherited COMBINING MACRON ○̅ U+0305 GC=Mn SC=Inherited COMBINING OVERLINE ○̆ U+0306 GC=Mn SC=Inherited COMBINING BREVE ○̇ U+0307 GC=Mn SC=Inherited COMBINING DOT ABOVE ○̈ U+0308 GC=Mn SC=Inherited COMBINING DIAERESIS ○̉ U+0309 GC=Mn SC=Inherited COMBINING HOOK ABOVE ○̊ U+030A GC=Mn SC=Inherited COMBINING RING ABOVE ○̋ U+030B GC=Mn SC=Inherited COMBINING DOUBLE ACUTE ACCENT ○̌ U+030C GC=Mn SC=Inherited COMBINING CARON ... And here are the Cased code points that are do not change when upper-, title-, or lowercased: % unichars -gs '\p{Cased}' '[^\p{CWU}\p{CWT}\p{CWL}]' ª U+00AA GC=Ll SC=Latin FEMININE ORDINAL INDICATOR º U+00BA GC=Ll SC=Latin MASCULINE ORDINAL INDICATOR ĸ U+0138 GC=Ll SC=Latin LATIN SMALL LETTER KRA ƍ U+018D GC=Ll SC=Latin LATIN SMALL LETTER TURNED DELTA ƛ U+019B GC=Ll SC=Latin LATIN SMALL LETTER LAMBDA WITH STROKE ƪ U+01AA GC=Ll SC=Latin LATIN LETTER REVERSED ESH LOOP ƫ U+01AB GC=Ll SC=Latin LATIN SMALL LETTER T WITH PALATAL HOOK ƺ U+01BA GC=Ll SC=Latin LATIN SMALL LETTER EZH WITH TAIL ƾ U+01BE GC=Ll SC=Latin LATIN LETTER INVERTED GLOTTAL STOP WITH STROKE ȡ U+0221 GC=Ll SC=Latin LATIN SMALL LETTER D WITH CURL ȴ U+0234 GC=Ll SC=Latin LATIN SMALL LETTER L WITH CURL ȵ U+0235 GC=Ll SC=Latin LATIN SMALL LETTER N WITH CURL ȶ U+0236 GC=Ll SC=Latin LATIN SMALL LETTER T WITH CURL ȷ U+0237 GC=Ll SC=Latin LATIN SMALL LETTER DOTLESS J ȸ U+0238 GC=Ll SC=Latin LATIN SMALL LETTER DB DIGRAPH ȹ U+0239 GC=Ll SC=Latin LATIN SMALL LETTER QP DIGRAPH ɕ U+0255 GC=Ll SC=Latin LATIN SMALL LETTER C WITH CURL ɘ U+0258 GC=Ll SC=Latin LATIN SMALL LETTER REVERSED E ɚ U+025A GC=Ll SC=Latin LATIN SMALL LETTER SCHWA WITH HOOK ɜ U+025C GC=Ll SC=Latin LATIN SMALL LETTER REVERSED OPEN E ɝ U+025D GC=Ll SC=Latin LATIN SMALL LETTER REVERSED OPEN E WITH HOOK ɞ U+025E GC=Ll SC=Latin LATIN SMALL LETTER CLOSED REVERSED OPEN E ɟ U+025F GC=Ll SC=Latin LATIN SMALL LETTER DOTLESS J WITH STROKE ɡ U+0261 GC=Ll SC=Latin LATIN SMALL LETTER SCRIPT G ɢ U+0262 GC=Ll SC=Latin LATIN LETTER SMALL CAPITAL G ɤ U+0264 GC=Ll SC=Latin LATIN SMALL LETTER RAMS HORN ɥ U+0265 GC=Ll SC=Latin LATIN SMALL LETTER TURNED H ɦ U+0266 GC=Ll SC=Latin LATIN SMALL LETTER H WITH HOOK ... You can get unichars from http://training.perl.com/scripts/unichars where you might also care to get uniprops and perhaps uninames to go with it. There are other Unicode tools there (the directory is 100% Unicode tools, not general scripts as its name suggests), but those are the important ones, I reckon. ---------- title: str.title() is overzealous by upcasing combining marks inappropriately -> str.title() is overzealous by upcasing combining marks inappropriately _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue12737> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com