New submission from Christoph Burgmer <cburg...@ira.uka.de>:

Titlecase, i.e. istitle() and title(), is buggy when the string
includes combining diacritical marks.

>>> u'H\u0301ngh'.istitle()
False
>>> u'H\u0301ngh'.title()
u'H\u0301Ngh'
>>>

The string given already is in titlecase so that the following result
is expected:
>>> u'H\u0301ngh'.istitle()
True
>>> u'H\u0301ngh'.title()
u'H\u0301ngh'
>>>

UTR#21 Case Mappings defines the following algorithm for titlecase
mapping [1]:

For each character C, find the preceding character B. 
  ignore any intervening case-ignorable characters when finding B.
If B exists, and is cased 
  map C to UCD_lower(C)
Otherwise, 
  map C to UCD_title(C)

The class of 'case-ignorable' is defined under [2] and includes
Nonspacing Marks (Mn) as listed in [3]. This includes diacritcal marks
and others. These should not be handled similar to spaces which they
currently are, thus dividing words.

A patch including the above test case is attached.

[1]
http://unicode.org/reports/tr21/tr21-5.html#Case_Conversion_of_Strings
[2] http://unicode.org/reports/tr21/tr21-5.html#Definitions
[3] http://www.fileformat.info/info/unicode/category/Mn/list.htm

----------
components: Library (Lib)
files: test_unicode.titlecase.diff
keywords: patch
messages: 90086
nosy: christoph
severity: normal
status: open
title: Titlecase as defined in Unicode Case Mappings not followed
versions: Python 2.5, Python 2.6
Added file: http://bugs.python.org/file14443/test_unicode.titlecase.diff

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue6412>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to