New submission from Christoph Burgmer <cburg...@ira.uka.de>: Titlecase, i.e. istitle() and title(), is buggy when the string includes combining diacritical marks.
>>> u'H\u0301ngh'.istitle() False >>> u'H\u0301ngh'.title() u'H\u0301Ngh' >>> The string given already is in titlecase so that the following result is expected: >>> u'H\u0301ngh'.istitle() True >>> u'H\u0301ngh'.title() u'H\u0301ngh' >>> UTR#21 Case Mappings defines the following algorithm for titlecase mapping [1]: For each character C, find the preceding character B. ignore any intervening case-ignorable characters when finding B. If B exists, and is cased map C to UCD_lower(C) Otherwise, map C to UCD_title(C) The class of 'case-ignorable' is defined under [2] and includes Nonspacing Marks (Mn) as listed in [3]. This includes diacritcal marks and others. These should not be handled similar to spaces which they currently are, thus dividing words. A patch including the above test case is attached. [1] http://unicode.org/reports/tr21/tr21-5.html#Case_Conversion_of_Strings [2] http://unicode.org/reports/tr21/tr21-5.html#Definitions [3] http://www.fileformat.info/info/unicode/category/Mn/list.htm ---------- components: Library (Lib) files: test_unicode.titlecase.diff keywords: patch messages: 90086 nosy: christoph severity: normal status: open title: Titlecase as defined in Unicode Case Mappings not followed versions: Python 2.5, Python 2.6 Added file: http://bugs.python.org/file14443/test_unicode.titlecase.diff _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue6412> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com