The documentation for Py_UNICODE states the following: "This type represents a 16-bit unsigned storage type which is used by Python internally as basis for holding Unicode ordinals. On platforms where wchar_t is available and also has 16-bits, Py_UNICODE is a typedef alias for wchar_t to enhance native platform compatibility. On all other platforms, Py_UNICODE is a typedef alias for unsigned short."
However, we have found this not to be true on at least certain RedHat versions (maybe all, but I'm not willing to say that at this point). pyconfig.h on these systems reports that PY_UNICODE_TYPE is wchar_t, and PY_UNICODE_SIZE is 4. Needless to say, this isn't consistent with the docs. It also creates quite a few problems when attempting to interface Python with other libraries which produce unicode data. Is this a bug, or is this behaviour intended? It turns out that at some point in the past, this created problems for tkinter as well, so someone just changed the internal unicode representation in tkinter to be 4 bytes as well, rather than tracking down the real source of the problem. Is PY_UNICODE_TYPE always going to be guaranteed to be 16 bits, or is it dependent on your platform? (in which case we can give up now on Python unicode compatibility with any other libraries). At the very least, if we can't guarantee the internal representation, then the PyUnicode_FromUnicode API needs to go away, and be replaced with something capable of transcoding various unicode inputs into the internal python representation. -- Nick _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com