New submission from Daniel Stutzbach <[email protected]>:
Currently, Python can be built with an internal Unicode representation of UCS2
or UCS4. To prevent extension modules compiled with the wrong Unicode
representation from linking, unicodeobject.h #defines many of the Unicode
functions. For example, PyUnicode_FromString becomes either
PyUnicodeUCS2_FromString or PyUnicodeUCS4_FromString.
Consequently, if one installs a binary egg (e.g., with easy_install), there's a
good chance one will get an error such as the following when trying to use it:
undefined symbol: PyUnicodeUCS2_FromString
In Python 2, only some extension modules were stung by this problem. For
Python 3, virtually every extension type will need to call a PyUnicode_*
function, since __repr__ must return a Unicode object. It's basically
fruitless to upload a binary egg for Python 3 to PyPi, since it will generate
link errors for a large fraction of downloaders (I discovered this the hard
way).
Right now, nearly all the functions in unicodeobject.h are wrapped. Several
functions are not. Many of the unwrapped functions also have no documentation,
so I'm guessing they are newer functions that were not wrapped when they were
added.
Most extensions treat PyUnicodeObjects as opaque and do not care if the
internal representation is UCS2 or UCS4. We can improve ABI compatibility by
only wrapping functions where the representation matters from the caller's
point of view.
For example, PyUnicode_FromUnicode creates a Unicode object from an array of
Py_UNICODE objects. It will interpret the data differently on UCS2 vs UCS4, so
the function should be wrapped.
On the other hand, PyUnicode_FromString creates a Unicode object from a char *.
The caller can treat the returned object as opaque, so the function should not
be wrapped.
The attached patch implements that rule. It unwraps 64 opaque functions that
were previously wrapped, and wraps 11 non-opaque functions that were previously
unwrapped. "make test" works with both UCS2 and UCS4 builds.
I previously brought this issue up on python-ideas, see:
http://mail.python.org/pipermail/python-ideas/2009-November/006543.html
Here's a summary of that discussion:
Zooko Wilcox-O'Hearn pointed out that my proposal is complimentary to his
proposal to standardize on UCS4, to reduce the risk of extension modules built
with a mismatched encoding.
Stefan Behnel pointed out that easy_install should allow eggs to specify the
encoding they require. PJE's proposed implementation of that feature
(http://bit.ly/1bO62) would allow eggs to specify UCS2, UCS4, or "Don't Care".
My proposal greatly increases the number of eggs that could label themselves
"Don't Care", reducing maintenance work for package maintainers. In other
words, they are complimentary fixes.
Guido liked the idea but expressed concern about the possibility of extension
modules that link successfully, but later crash because they actually do depend
on the UCS2/UCS4 distinction.
With my current patch, there are still two ways for that to happen:
1) The extension uses only opaque functions, but casts the returned PyObject *
to PyUnicodeObject * and accesses the str member, or
2) The extension uses only opaque functions, but uses the PyUnicode_AS_UNICODE
or PyUnicode_AS_DATA macros.
Most packages that poke into the internals of PyUnicodeObject also call
non-opaque functions. Consequently, they will still generate a linker error if
the encoding is mismatched, as desired.
I'm trying to come up with a way to 100% guarantee that any extension poking
into the internals will generate a linker error if the encoding is mismatched,
even if they don't call any non-opaque functions. I'll post about that in a
separate comment to this bug.
----------
assignee: stutzbach
components: Interpreter Core, Unicode
messages: 105222
nosy: stutzbach
priority: normal
severity: normal
stage: needs patch
status: open
title: Improve ABI compatibility between UCS2 and UCS4 builds
type: behavior
versions: Python 3.2
_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue8654>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com