Hi Scott,

> Any insights on what I'm missing would be greatly appreciated.


Traditional Python strings are much more like byte arrays than
character strings.  However, explicit unicode strings can be defined
as well, but it is a separate data type.

Your isinstance() test is merely checking the data type of the object,
but this has nothing to do with the content stored within.

For instance:

>>> str = u'ä'
>>> if isinstance(str, unicode):
...         print "This is unicode"
... 
This is unicode


And:

>>> str = u'any string, now stored as unicode'
>>> if isinstance(str, unicode):
...         print "This is unicode"
... 
This is unicode


Note the "u" letter prefix to the string definitions.


I suspect when you include a character with an umlaut statically in
the script as a traditional string, this is automatically encoded in
your default character set (I guess utf-8) and stored within the
string (once again, just a sequence of bytes). 

When you read data in from your users and want to inspect it for
character content that doesn't fall within traditional ascii, I
recommend you first decode it to unicode and then perform operations
on it that way.  But for goodness sakes, don't force it to "ascii"!
If you want to handle unicode, then interpret the input as utf-8 or
whatever makes sense, then manipulate the resulting unicode object,
preserving the extended character set.

Consider this:

>>> raw = 'ä'
>>> unicode = raw.decode('utf-8')
>>> for c in unicode:
...     print ord(c)
... 
228


Here, since Python knows how to interpret the value stored in the
unicode object, the logical character value is printed out, rather
than seeing two encoded bytes.


Now, beyond just getting the characters converted into unicode
properly, you still have to worry about what does Python consider to
be an uppercase vs. lowercase character.  I believe that will depend
on the locale you have set in the environment.  But that's about as
far as my knowledge goes here...

Hope that helps,
tim


PS- In Python 3, the default string object *is* unicode.  The old
    behavior of strings is relegated to bytes().  In some ways this
    makes it easier to understand what is going on with unicode.

_______________________________________________
Portland mailing list
Portland@python.org
https://mail.python.org/mailman/listinfo/portland

Reply via email to