Re: [Pythonmac-SIG] Unicode and split

Christopher Barker Fri, 23 May 2008 09:14:55 -0700

Jeremy Reichman wrote:

I have some characters in line strings in a file I'm processing that appear
to be Unicode. (When I print them to the shell from my script, they are
Asian characters for files like fonts in the Mac OS X filesystem.)


When I run a.split() on the affected line strings, they split on what I'm
guessing is considered a Unicode whitespace character. Specifically, the
culprit seems to be '\xe1':

$ python -c 'print "\xe1"'
?

actually, u'xe1' is a lower case accented a: á (if the unicode comesthrough email OK), so I doubt that python is splitting on that.

Also, when you do the above, you're creating a regular string, not aunicode object. If you do:


$ python -c 'print u"\xe1"'
á

You may get the right thing, if you're terminal is set up right todisplay unicode.

I suspect your problem is that you aren't decoding the input filecorrectly. The whole problem with unicode (and indeed, any non-asciiencoding), is that you need to know what encoding your data is, in orderto use it. if it looks mostly OK when interpreted as ASCII, then inMIGHT be utf8, so try reading in your file and decoding it this way:


contents = myfile.read().decode('utf8')

Then do your splitting. If it's not utf8, then you'll need to figure outwhat it is.


First, read this:
http://www.joelonsoftware.com/articles/Unicode.html

then take a look at some of the python unicode tutorials, this is onlyone of them:


http://www.reportlab.com/i18n/python_unicode_tutorial.html

there are other good ones.

-Chris



--
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

[EMAIL PROTECTED]

_______________________________________________
Pythonmac-SIG maillist  -  Pythonmac-SIG@python.org
http://mail.python.org/mailman/listinfo/pythonmac-sig

Re: [Pythonmac-SIG] Unicode and split

Reply via email to