Hi Scott, > Any insights on what I'm missing would be greatly appreciated.
Traditional Python strings are much more like byte arrays than character strings. However, explicit unicode strings can be defined as well, but it is a separate data type. Your isinstance() test is merely checking the data type of the object, but this has nothing to do with the content stored within. For instance: >>> str = u'ä' >>> if isinstance(str, unicode): ... print "This is unicode" ... This is unicode And: >>> str = u'any string, now stored as unicode' >>> if isinstance(str, unicode): ... print "This is unicode" ... This is unicode Note the "u" letter prefix to the string definitions. I suspect when you include a character with an umlaut statically in the script as a traditional string, this is automatically encoded in your default character set (I guess utf-8) and stored within the string (once again, just a sequence of bytes). When you read data in from your users and want to inspect it for character content that doesn't fall within traditional ascii, I recommend you first decode it to unicode and then perform operations on it that way. But for goodness sakes, don't force it to "ascii"! If you want to handle unicode, then interpret the input as utf-8 or whatever makes sense, then manipulate the resulting unicode object, preserving the extended character set. Consider this: >>> raw = 'ä' >>> unicode = raw.decode('utf-8') >>> for c in unicode: ... print ord(c) ... 228 Here, since Python knows how to interpret the value stored in the unicode object, the logical character value is printed out, rather than seeing two encoded bytes. Now, beyond just getting the characters converted into unicode properly, you still have to worry about what does Python consider to be an uppercase vs. lowercase character. I believe that will depend on the locale you have set in the environment. But that's about as far as my knowledge goes here... Hope that helps, tim PS- In Python 3, the default string object *is* unicode. The old behavior of strings is relegated to bytes(). In some ways this makes it easier to understand what is going on with unicode. _______________________________________________ Portland mailing list Portland@python.org https://mail.python.org/mailman/listinfo/portland