[issue5200] unicode.normalize gives wrong result for some characters

Peter Landgren Wed, 11 Feb 2009 00:24:17 -0800

Peter Landgren <peter.tal...@telia.com> added the comment:

> Martin v. Löwis <mar...@v.loewis.de> added the comment:
> > The same applies  "Å" and "A", "Ä" and "A" and "Ö" and "O"
> > which also are also different letters as "Ø" and "O" are.
>
> Sure. And rightfully, they "Å" is *not* (I repeat: not)
> normalized as "A", under NFD:
>
> py> unicodedata.normalize("NFD", u"Å")
> u'A\u030a'
>
> > Maybe not in the unicode world but in treal life.
>
> They are different letters also in the Unicode world.
>
> > That's why I'm a little confused.
>
> I think the confusion comes from your assumption that
> normalizing "Å" produces "A". It does not. Really not.


Yes, you are right.

However the confusion/problem shows up when it is used in the application to
build an alphabet and group for example all version of E, É, È, Ë, Ê
together under E. The first character in the result of normalize is
used to build alphabet labels for surnames:

letter = normalize("NFD", surname)[0].upper()
if letter != last_letter:
    last_letter = letter
....
and this is why I get "A" when the surname begins with "Å".

This way it works for all variations of E to be grouped under "E",
but fails as "Å" is shown under the label "A", not the "A" in the
beginning of the alphabet but after "Z", where "ÅÄÖ" comes.
So a previous sorting of the surnames works correctly.
(The Swedish alphabet has 29 letters: A,B,C... X,Y,Z,Å,Ä,Ö)

Can you think of any solution to this conflict? 

u'\xd8'

u'A\u030a'

u'\xc5'

This is obviously the result of how the unicode spec is written
interpreting "Å" as a variation of "A". which it is not.

I have asked the unicode people, but not got any answer yet.

The application is GRAMPS: http://gramps-project.org/

Once again thanks for make some of the unicode stuff clear!
Regards,
Peter Landgren

Added file: http://bugs.python.org/file13025/unnamed

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue5200>
_______________________________________

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN" 
"http://www.w3.org/TR/REC-html40/strict.dtd";>
<html><head><meta name="qrichtext" content="1" /><style type="text/css">
p, li { white-space: pre-wrap; }
</style></head><body style=" font-family:'Sans Serif'; font-size:10pt; 
font-weight:400; font-style:normal;">
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; 
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">&gt; 
Martin v. LÃ¶wis &lt;mar...@v.loewis.de&gt; added the comment:</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; 
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">&gt; 
&gt; The same applies  "Ã" and "A", "Ã" and "A" and "Ã" and "O"</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; 
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">&gt; 
&gt; which also are also different letters as "Ã" and "O" are.</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; 
margin-right:0px; -qt-block-indent:0; text-indent:0px; 
-qt-user-state:0;">&gt;</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; 
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">&gt; 
Sure. And rightfully, they "Ã" is *not* (I repeat: not)</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; 
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">&gt; 
normalized as "A", under NFD:</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; 
margin-right:0px; -qt-block-indent:0; text-indent:0px; 
-qt-user-state:0;">&gt;</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; 
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">&gt; 
py&gt; unicodedata.normalize("NFD", u"Ã")</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; 
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">&gt; 
u'A\u030a'</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; 
margin-right:0px; -qt-block-indent:0; text-indent:0px; 
-qt-user-state:0;">&gt;</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; 
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">&gt; 
&gt; Maybe not in the unicode world but in treal life.</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; 
margin-right:0px; -qt-block-indent:0; text-indent:0px; 
-qt-user-state:0;">&gt;</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; 
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">&gt; 
They are different letters also in the Unicode world.</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; 
margin-right:0px; -qt-block-indent:0; text-indent:0px; 
-qt-user-state:0;">&gt;</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; 
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">&gt; 
&gt; That's why I'm a little confused.</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; 
margin-right:0px; -qt-block-indent:0; text-indent:0px; 
-qt-user-state:0;">&gt;</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; 
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">&gt; 
I think the confusion comes from your assumption that</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; 
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">&gt; 
normalizing "Ã" produces "A". It does not. Really not.</p>
<p style="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px; 
margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; 
-qt-user-state:0;"></p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; 
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">Yes, 
you are right.</p>
<p style="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px; 
margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; 
-qt-user-state:0;"></p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; 
margin-right:0px; -qt-block-indent:0; text-indent:0px; 
-qt-user-state:0;">However the confusion/problem shows up when it is used in 
the application to</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; 
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">build 
an alphabet and group for example all version of E, Ã, Ã, Ã, Ã</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; 
margin-right:0px; -qt-block-indent:0; text-indent:0px; 
-qt-user-state:0;">together under E. The first character in the result of 
normalize is</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; 
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">used 
to build alphabet labels for surnames:</p>
<p style="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px; 
margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; 
-qt-user-state:0;"></p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; 
margin-right:0px; -qt-block-indent:0; text-indent:0px; 
-qt-user-state:0;">letter = normalize("NFD", surname)[0].upper()</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; 
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">if 
letter != last_letter:</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; 
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">    
last_letter = letter</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; 
margin-right:0px; -qt-block-indent:0; text-indent:0px; 
-qt-user-state:0;">....</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; 
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">and 
this is why I get "A" when the surname begins with "Ã".</p>
<p style="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px; 
margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; 
-qt-user-state:0;"></p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; 
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">This 
way it works for all variations of E to be grouped under "E",</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; 
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">but 
fails as "Ã" is shown under the label "A", not the "A" in the</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; 
margin-right:0px; -qt-block-indent:0; text-indent:0px; 
-qt-user-state:0;">beginning of the alphabet but after "Z", where "ÃÃÃ" 
comes.</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; 
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">So a 
previous sorting of the surnames works correctly.</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; 
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">(The 
Swedish alphabet has 29 letters: A,B,C... X,Y,Z,Ã,Ã,Ã)</p>
<p style="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px; 
margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; 
-qt-user-state:0;"></p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; 
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">Can 
you think of any solution to this conflict? </p>
<p style="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px; 
margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; 
-qt-user-state:0;"></p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; 
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">I 
still think "Ã" or "Ã" or "Ã" should behave as "Ã":</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; 
margin-right:0px; -qt-block-indent:0; text-indent:0px; 
-qt-user-state:0;">&gt;&gt;&gt; unicodedata.normalize("NFD",u"Ã")</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; 
margin-right:0px; -qt-block-indent:0; text-indent:0px; 
-qt-user-state:0;">u'\xd8'</p>
<p style="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px; 
margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; 
-qt-user-state:0;"></p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; 
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">Now, 
as you said:</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; 
margin-right:0px; -qt-block-indent:0; text-indent:0px; 
-qt-user-state:0;">&gt;&gt;&gt; unicodedata.normalize("NFD",u"Ã")</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; 
margin-right:0px; -qt-block-indent:0; text-indent:0px; 
-qt-user-state:0;">u'A\u030a'</p>
<p style="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px; 
margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; 
-qt-user-state:0;"></p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; 
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">But 
it should be (in my opinion):</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; 
margin-right:0px; -qt-block-indent:0; text-indent:0px; 
-qt-user-state:0;">&gt;&gt;&gt; unicodedata.normalize("NFD",u"Ã")</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; 
margin-right:0px; -qt-block-indent:0; text-indent:0px; 
-qt-user-state:0;">u'\xc5'</p>
<p style="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px; 
margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; 
-qt-user-state:0;"></p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; 
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">This 
is obviously the result of how the unicode spec is written</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; 
margin-right:0px; -qt-block-indent:0; text-indent:0px; 
-qt-user-state:0;">interpreting "Ã" as a variation of "A". which it is not.</p>
<p style="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px; 
margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; 
-qt-user-state:0;"></p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; 
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">I 
have asked the unicode people, but not got any answer yet.</p>
<p style="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px; 
margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; 
-qt-user-state:0;"></p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; 
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">The 
application is GRAMPS: http://gramps-project.org/</p>
<p style="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px; 
margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; 
-qt-user-state:0;"></p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; 
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">Once 
again thanks for make some of the unicode stuff clear!</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; 
margin-right:0px; -qt-block-indent:0; text-indent:0px; 
-qt-user-state:0;">Regards,</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; 
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">Peter 
Landgren</p>
<p style="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px; 
margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; 
-qt-user-state:0;"></p></body></html>

_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue5200] unicode.normalize gives wrong result for some characters

Reply via email to