Thanks Eric! My simplified test case was throwing me off. Now with that
fixed, I see that islower() seems to work with the umlat characters,
which is how I can address the main problem I'm working on.
Regards,
Scott
On 10/12/2014 03:24 PM, Eric Holscher wrote:
I think you want:
string = u'ä'
This will define it at a unicode string.
For example:
In [7]: string = unicode('ä')
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-7-5cba4c2df988> in <module>()
----> 1 string = unicode('ä')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0:
ordinal not in range(128)
In [8]: string = unicode(u'ä')
Cheers,
Eric
On Sun, Oct 12, 2014 at 3:01 PM, Scott Garman <[email protected]> wrote:
Hi all,
I'm getting pretty confused by a problem I'm trying to solve in python,
which is to detect lower-case characters in a string. This would normally
be a simple regex, but I have to also accept input strings with umlats in
them, such as 'ä'. I'm using python 2.7.6.
At first I thought this was a unicode problem, but now I'm not so sure.
About anything.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
str = 'ä'
if isinstance(str, unicode):
print "This is unicode"
Running this tells me that string is *not* unicode. I know that there's a
thing called extended ASCII, and if I look up a table for that, I see
characters with accents and umlats:
http://www.asciitable.com/
This table suggests that 'ä' should correspond to an ordinal value of 132.
But if I run:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
string = 'ä'
for c in string:
print ord(c)
I get:
195
164
which tells me that I'm dealing with a two-byte character, which brings me
back to this being unicode.
Now looking at which characters in the extended ASCII table correspond to
those values, I don't see any relation to 'ä'.
Finally, my understanding of python 2.x is that it does not support
unicode in regexes. Otherwise I'd just use \p{Ll} and have a good deal more
hair left on my head.
I've also tried forcing the string to ASCII using:
str.decode("ascii", "ignore")
and this is one of those characters that just gets dropped in the
conversion.
Any insights on what I'm missing would be greatly appreciated.
Thanks,
Scott
_______________________________________________
Portland mailing list
[email protected]
https://mail.python.org/mailman/listinfo/portland
_______________________________________________
Portland mailing list
[email protected]
https://mail.python.org/mailman/listinfo/portland