Jeremy Reichman wrote:
I have some characters in line strings in a file I'm processing that appear
to be Unicode. (When I print them to the shell from my script, they are
Asian characters for files like fonts in the Mac OS X filesystem.)
When I run a.split() on the affected line strings, they split on what I'm
guessing is considered a Unicode whitespace character. Specifically, the
culprit seems to be '\xe1':
$ python -c 'print "\xe1"'
?
actually, u'xe1' is a lower case accented a: á (if the unicode comes
through email OK), so I doubt that python is splitting on that.
Also, when you do the above, you're creating a regular string, not a
unicode object. If you do:
$ python -c 'print u"\xe1"'
á
You may get the right thing, if you're terminal is set up right to
display unicode.
I suspect your problem is that you aren't decoding the input file
correctly. The whole problem with unicode (and indeed, any non-ascii
encoding), is that you need to know what encoding your data is, in order
to use it. if it looks mostly OK when interpreted as ASCII, then in
MIGHT be utf8, so try reading in your file and decoding it this way:
contents = myfile.read().decode('utf8')
Then do your splitting. If it's not utf8, then you'll need to figure out
what it is.
First, read this:
http://www.joelonsoftware.com/articles/Unicode.html
then take a look at some of the python unicode tutorials, this is only
one of them:
http://www.reportlab.com/i18n/python_unicode_tutorial.html
there are other good ones.
-Chris
--
Christopher Barker, Ph.D.
Oceanographer
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
[EMAIL PROTECTED]
_______________________________________________
Pythonmac-SIG maillist - Pythonmac-SIG@python.org
http://mail.python.org/mailman/listinfo/pythonmac-sig