Peter Landgren <peter.tal...@telia.com> added the comment:
> Martin v. Löwis <mar...@v.loewis.de> added the comment:
> > The same applies "Å" and "A", "Ä" and "A" and "Ö" and "O"
> > which also are also different letters as "Ø" and "O" are.
>
> Sure. And rightfully, they "Å" is *not* (I repeat: not)
> normalized as "A", under NFD:
>
> py> unicodedata.normalize("NFD", u"Å")
> u'A\u030a'
>
> > Maybe not in the unicode world but in treal life.
>
> They are different letters also in the Unicode world.
>
> > That's why I'm a little confused.
>
> I think the confusion comes from your assumption that
> normalizing "Å" produces "A". It does not. Really not.
Yes, you are right.
However the confusion/problem shows up when it is used in the application to
build an alphabet and group for example all version of E, É, È, Ë, Ê
together under E. The first character in the result of normalize is
used to build alphabet labels for surnames:
letter = normalize("NFD", surname)[0].upper()
if letter != last_letter:
last_letter = letter
....
and this is why I get "A" when the surname begins with "Å".
This way it works for all variations of E to be grouped under "E",
but fails as "Å" is shown under the label "A", not the "A" in the
beginning of the alphabet but after "Z", where "ÅÄÖ" comes.
So a previous sorting of the surnames works correctly.
(The Swedish alphabet has 29 letters: A,B,C... X,Y,Z,Å,Ä,Ö)
Can you think of any solution to this conflict?
u'\xd8'
u'A\u030a'
u'\xc5'
This is obviously the result of how the unicode spec is written
interpreting "Å" as a variation of "A". which it is not.
I have asked the unicode people, but not got any answer yet.
The application is GRAMPS: http://gramps-project.org/
Once again thanks for make some of the unicode stuff clear!
Regards,
Peter Landgren
Added file: http://bugs.python.org/file13025/unnamed
_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue5200>
_______________________________________
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN"
"http://www.w3.org/TR/REC-html40/strict.dtd">
<html><head><meta name="qrichtext" content="1" /><style type="text/css">
p, li { white-space: pre-wrap; }
</style></head><body style=" font-family:'Sans Serif'; font-size:10pt;
font-weight:400; font-style:normal;">
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">>
Martin v. Löwis <mar...@v.loewis.de> added the comment:</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">>
> The same applies "Ã
" and "A", "Ã" and "A" and "Ã" and "O"</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">>
> which also are also different letters as "Ã" and "O" are.</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px; -qt-block-indent:0; text-indent:0px;
-qt-user-state:0;">></p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">>
Sure. And rightfully, they "Ã
" is *not* (I repeat: not)</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">>
normalized as "A", under NFD:</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px; -qt-block-indent:0; text-indent:0px;
-qt-user-state:0;">></p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">>
py> unicodedata.normalize("NFD", u"Ã
")</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">>
u'A\u030a'</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px; -qt-block-indent:0; text-indent:0px;
-qt-user-state:0;">></p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">>
> Maybe not in the unicode world but in treal life.</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px; -qt-block-indent:0; text-indent:0px;
-qt-user-state:0;">></p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">>
They are different letters also in the Unicode world.</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px; -qt-block-indent:0; text-indent:0px;
-qt-user-state:0;">></p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">>
> That's why I'm a little confused.</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px; -qt-block-indent:0; text-indent:0px;
-qt-user-state:0;">></p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">>
I think the confusion comes from your assumption that</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">>
normalizing "Ã
" produces "A". It does not. Really not.</p>
<p style="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px;
margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;
-qt-user-state:0;"></p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">Yes,
you are right.</p>
<p style="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px;
margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;
-qt-user-state:0;"></p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px; -qt-block-indent:0; text-indent:0px;
-qt-user-state:0;">However the confusion/problem shows up when it is used in
the application to</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">build
an alphabet and group for example all version of E, Ã, Ã, Ã, Ã</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px; -qt-block-indent:0; text-indent:0px;
-qt-user-state:0;">together under E. The first character in the result of
normalize is</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">used
to build alphabet labels for surnames:</p>
<p style="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px;
margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;
-qt-user-state:0;"></p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px; -qt-block-indent:0; text-indent:0px;
-qt-user-state:0;">letter = normalize("NFD", surname)[0].upper()</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">if
letter != last_letter:</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">
last_letter = letter</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px; -qt-block-indent:0; text-indent:0px;
-qt-user-state:0;">....</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">and
this is why I get "A" when the surname begins with "Ã
".</p>
<p style="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px;
margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;
-qt-user-state:0;"></p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">This
way it works for all variations of E to be grouped under "E",</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">but
fails as "Ã
" is shown under the label "A", not the "A" in the</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px; -qt-block-indent:0; text-indent:0px;
-qt-user-state:0;">beginning of the alphabet but after "Z", where "Ã
ÃÃ"
comes.</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">So a
previous sorting of the surnames works correctly.</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">(The
Swedish alphabet has 29 letters: A,B,C... X,Y,Z,Ã
,Ã,Ã)</p>
<p style="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px;
margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;
-qt-user-state:0;"></p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">Can
you think of any solution to this conflict? </p>
<p style="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px;
margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;
-qt-user-state:0;"></p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">I
still think "Ã
" or "Ã" or "Ã" should behave as "Ã":</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px; -qt-block-indent:0; text-indent:0px;
-qt-user-state:0;">>>> unicodedata.normalize("NFD",u"Ã")</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px; -qt-block-indent:0; text-indent:0px;
-qt-user-state:0;">u'\xd8'</p>
<p style="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px;
margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;
-qt-user-state:0;"></p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">Now,
as you said:</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px; -qt-block-indent:0; text-indent:0px;
-qt-user-state:0;">>>> unicodedata.normalize("NFD",u"Ã
")</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px; -qt-block-indent:0; text-indent:0px;
-qt-user-state:0;">u'A\u030a'</p>
<p style="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px;
margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;
-qt-user-state:0;"></p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">But
it should be (in my opinion):</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px; -qt-block-indent:0; text-indent:0px;
-qt-user-state:0;">>>> unicodedata.normalize("NFD",u"Ã
")</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px; -qt-block-indent:0; text-indent:0px;
-qt-user-state:0;">u'\xc5'</p>
<p style="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px;
margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;
-qt-user-state:0;"></p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">This
is obviously the result of how the unicode spec is written</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px; -qt-block-indent:0; text-indent:0px;
-qt-user-state:0;">interpreting "Ã
" as a variation of "A". which it is not.</p>
<p style="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px;
margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;
-qt-user-state:0;"></p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">I
have asked the unicode people, but not got any answer yet.</p>
<p style="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px;
margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;
-qt-user-state:0;"></p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">The
application is GRAMPS: http://gramps-project.org/</p>
<p style="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px;
margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;
-qt-user-state:0;"></p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">Once
again thanks for make some of the unicode stuff clear!</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px; -qt-block-indent:0; text-indent:0px;
-qt-user-state:0;">Regards,</p>
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">Peter
Landgren</p>
<p style="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px;
margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;
-qt-user-state:0;"></p></body></html>
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com