I'm working with Python 2.5.4 and the NLTK (Natural Language Toolkit). I'm an experienced programmer, but new to Python.
This question arose when I tried to create a literal in my source code for a Unicode codepoint greater than 255. (I also posted this question in the NLTK discussion group). The Python HELP (at least for version 2.5.4) states: +++++++ Python supports writing Unicode literals in any encoding, but you have to declare the encoding being used. This is done by including a special comment as either the first or second line of the source file: #!/usr/bin/env python # -*- coding: latin-1 -*- ++++++++++++ Based on some experimenting I've done, I suspect that the support for Unicode literals in ANY encoding isn't really accurate. What seems to happen is that there must be an 8-bit mapping between the set of Unicode literals and what can be used as literals. Even when I set Options / General / Default Source Encoding to UTF-8, IDLE won't allow Unicode literals (e.g. characters copied and pasted from the Windows Character Map program) to be used, even in a quoted string, if they represent an ord value greater than 255. I noticed, in researching this question, that Marc Andre Lemburg stated, back in 2001, "Since Python source code is defined to be ASCII..." I'm writing code for linguistics (other than English), so I need access to lots more characters. Most of the time, the characters come from files, so no problem. But for some processing tasks, I simply must be able to use "real" Unicode literals in the source code. (Writing hex escape sequences in a complex regex would be a nightmare). Was this taken care of in the switch from Python 2.X to 3.X? Is there a way to use more than 255 Unicode characters in source code literals in Python 2.5.4? Also, in the Windows version of Python, how can I tell if it was compiled to support 16 bits of Unicode or 32 bits of Unicode? Bruce in Toronto -- http://mail.python.org/mailman/listinfo/python-list