On 10/12/2014 03:33 PM, Sam Thompson wrote:
The first important concept to understand is that UTF-8 and Unicode are not
the same thing.

Because you specified coding: utf-8, every string you define within the
python script is a bytestring encoded using utf-8.  This is not the same as
a python unicode object, it is a bytestring (because you are using python
2.x).

The reason that 'ä' produces two character ordinals is that utf-8 is
variable in character length.  195+164 is the code point for 'ä'.  If you
want the python unicode object for the string, use mystring.decode('utf-8')
instead of 'ascii', because it's not ascii.

The second important concept is that strings defined within the python
script may not be the same type as strings read from input, a file, a web
request, etc.  Where is your input coming from?

If you can be sure your input is utf-8 (and this is a giant leap if you're
working with web input), convert it to unicode (via .decode()), iterate
over the unicode sequence and test each character with .islower().

If you can't be sure what encoding your bytestrings are in, check out the
chardet library on pypi.

Thanks Sam, this explanation helped to fill in my gaps on bytestrings and unicode in python, which until now I've been quite clueless about.

Scott

_______________________________________________
Portland mailing list
Portland@python.org
https://mail.python.org/mailman/listinfo/portland

Reply via email to