Re: [Chandler-dev] Unicode and letting u be u (umlaut)

Brian Kirsch Wed, 12 Apr 2006 16:46:41 -0700

Yes Python Unicode is fun to work with isn't it :)

Grant raises a good point. The \u syntax is the best way to ensure thatwhat you intended to render actually is correct.

Python provides a means to specify the source character set encoding atthe top of a python file.If one does that then text will be converted to Unicode from thatcharacter set in the file space.


For example:

# -*- coding: utf-8 -*-

exampleText = u"This is some Unicode with non- ascii character: ü"
print repr(exampleText)
>>> u'This is some Unicode with non- ascii character: \xfc'

However, in the command line interpreter Python does not know what thesource encoding is and

it must be explicitly defined.

>>> exampleText = unicode("This is some Unicode with non- asciicharacter: ü", "utf8")

>>> exampleText
>>> u'This is some Unicode with non- ascii character: \xfc'

My example in the i18n Busy Developer Guide should have been the aboveand not:

>>> exampleInstance.exampleText = u"This is some Unicode with non-asciicharacter: ü"

>>> exampleInstance.exampleText
u"This is some Unicode with non-ascii character: \xc3\xbc"

I have updated the guide to correct the error.

Using the \u syntax is a better choice because no encoding needs to beexplicitly specified in the file or the terminal.


For example:

exampleText = u"This is some Unicode with non- ascii character: \u00FC"

Of course if your terminal uses the ASCII character set then it will notrender correctly :)



--Brian








Brian Kirsch -  Cosmo Developer / Chandler Internationalization Engineer
Open Source Applications Foundation
543 Howard St. 5th Floor
San Francisco, CA 94105
http://www.osafoundation.org



Grant Baillie wrote:

I've run across a couple of cases of specifying unicode characters inPython code that were a little fishy, so I thought I'd send out along, rambly email to the list.
The 10-second summary is: If you want to specify a non-ASCIIcharacter in a unicode string, the python \uxxxx escape is yourfriend. With anything else, you're playing with fire.
So, to cut a short story long, I was looking at a test case inChandler, where we were trying to come up with a non-ASCII path touse in a Chandler profile directory:
TestCrypto.py:13:        u = u"profileDir_(\xc3\xbc)" # u umlaut
This actually succeeds in setting u to be a non-ASCII string, exceptthat it doesn't contain a "u umlaut". When you specify a u"..." stylestring in Python, you're telling the interpreter to assume eachcharacter in the string is a unicode code point. Looking at the list in
<http://www.unicode.org/Public/UNIDATA/NamesList.txt>

you can determine that "u umlaut" is the Unicode character(*)

    00FC    LATIN SMALL LETTER U WITH DIAERESIS

but in the above, the \xc3 and \xbc are interpreted as:

   00C3    LATIN CAPITAL LETTER A WITH TILDE
   00BC    VULGAR FRACTION ONE QUARTER

Clearly, we don't want any vulgarity in our paths, now do we :) ?
It turns out that the author of the above code was having troubleentering u umlaut (in a console, or a code editor). As mentionedabove, the easiest and most portable way to do this kind of thing isto use the \u escape, viz:
     u = u"profileDir_(\u00fc)" # u umlaut
In the case of source files, Python has some handy conventions forspecifying what character encoding of a source file is (see <http://docs.python.org/ref/encodings.html#encodings>). Unfortunately, itturns out that there's no convention that's adopted by many editors.Possibly this is a reason to require everyone to use emacs, or vim,but the resulting religious war would take us well past Chandler 1.0 :).
In the case of entering text in an interactive session, you'resomewhat at the mercy of your terminal program, as well as yourlocale. To continue the story, the characters \xc3\xbc above (whichare the UTF-8 encoding of \u00fc), did not come from nowhere. Thedeveloper mentioned earlier copy-and-pasted them from the followingbit of text in the I18n Busy Developers Guide:
>>> exampleInstance.exampleText = u"This is some unicode with non-ascii character: ü"
>>> exampleInstance.exampleText
u"This is some unicode with non-ascii character: \xc3\xbc"
As we determined above, the printed-out value does not end with ü. Infact, what happened above was the terminal program was using UTF-8,but Python had no idea that that was the case, and converted the rawUTF-8 bytes to unicode characters.
--Grant

(*) It's also representable as the sequence of two characters

   0075    LATIN SMALL LETTER U
   0308    COMBINING DIAERESIS (Dialytika)
           = double dot above, umlaut
           = Greek dialytika
           = double derivative
           x (diaeresis - 00A8)

but that's a whole different can of fish, er crosstown bus.





_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Open Source Applications Foundation "chandler-dev" mailing list
http://lists.osafoundation.org/mailman/listinfo/chandler-dev


_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Open Source Applications Foundation "chandler-dev" mailing list
http://lists.osafoundation.org/mailman/listinfo/chandler-dev

Re: [Chandler-dev] Unicode and letting u be u (umlaut)

Reply via email to