Uncle Bruce wrote: > I'm working with Python 2.5.4 and the NLTK (Natural Language > Toolkit). I'm an experienced programmer, but new to Python. > > This question arose when I tried to create a literal in my source code > for a Unicode codepoint greater than 255. (I also posted this > question in the NLTK discussion group). > > The Python HELP (at least for version 2.5.4) states: > > +++++++ > Python supports writing Unicode literals in any encoding, but you have > to declare the encoding being used. This is done by including a > special comment as either the first or second line of the source file: > > #!/usr/bin/env python > # -*- coding: latin-1 -*- > ++++++++++++ > > Based on some experimenting I've done, I suspect that the support for > Unicode literals in ANY encoding isn't really accurate. What seems to > happen is that there must be an 8-bit mapping between the set of > Unicode literals and what can be used as literals. > > Even when I set Options / General / Default Source Encoding to UTF-8, > IDLE won't allow Unicode literals (e.g. characters copied and pasted > from the Windows Character Map program) to be used, even in a quoted > string, if they represent an ord value greater than 255. > > I noticed, in researching this question, that Marc Andre Lemburg > stated, back in 2001, "Since Python source code is defined to be > ASCII..." > > I'm writing code for linguistics (other than English), so I need > access to lots more characters. Most of the time, the characters come > from files, so no problem. But for some processing tasks, I simply > must be able to use "real" Unicode literals in the source code. > (Writing hex escape sequences in a complex regex would be a > nightmare). > > Was this taken care of in the switch from Python 2.X to 3.X? > > Is there a way to use more than 255 Unicode characters in source code > literals in Python 2.5.4? > > Also, in the Windows version of Python, how can I tell if it was > compiled to support 16 bits of Unicode or 32 bits of Unicode? > > Bruce in Toronto
Works for me: --- snip --- $ cat snowman.py #!/usr/bin/env python # -*- coding: utf-8 -*- import unicodedata snowman = u'☃' print len(snowman) print unicodedata.name(snowman) $ python2.6 snowman.py 1 SNOWMAN --- snip --- What did you set the encoding to in the declaration at the top of the file? The help text you quoted uses latin-1 as an example, an encoding which, of course, only supports 256 code points. Did you try utf-8 instead? The regular expression engine's Unicode support is a different question, and I do not know the answer. By the way, Python 2.x only supports using non-ASCII characters in source code in string literals. Python 3 adds support for Unicode identifiers (e.g. variable names, function argument names, etc.). -- -- http://mail.python.org/mailman/listinfo/python-list