[portland] Umlats from another dimension

Scott Garman Sun, 12 Oct 2014 15:10:53 -0700

Hi all,

I'm getting pretty confused by a problem I'm trying to solve in python,which is to detect lower-case characters in a string. This wouldnormally be a simple regex, but I have to also accept input strings withumlats in them, such as 'ä'. I'm using python 2.7.6.

At first I thought this was a unicode problem, but now I'm not so sure.About anything.


#!/usr/bin/env python
# -*- coding: utf-8 -*-

str = 'ä'

if isinstance(str, unicode):
        print "This is unicode"

Running this tells me that string is *not* unicode. I know that there'sa thing called extended ASCII, and if I look up a table for that, I seecharacters with accents and umlats:


http://www.asciitable.com/

This table suggests that 'ä' should correspond to an ordinal value of132. But if I run:


#!/usr/bin/env python
# -*- coding: utf-8 -*-

string = 'ä'

for c in string:
    print ord(c)

I get:

195
164

which tells me that I'm dealing with a two-byte character, which bringsme back to this being unicode.

Now looking at which characters in the extended ASCII table correspondto those values, I don't see any relation to 'ä'.

Finally, my understanding of python 2.x is that it does not supportunicode in regexes. Otherwise I'd just use \p{Ll} and have a good dealmore hair left on my head.


I've also tried forcing the string to ASCII using:

str.decode("ascii", "ignore")

and this is one of those characters that just gets dropped in theconversion.


Any insights on what I'm missing would be greatly appreciated.

Thanks,

Scott

_______________________________________________
Portland mailing list
[email protected]
https://mail.python.org/mailman/listinfo/portland

[portland] Umlats from another dimension

Reply via email to