Paul Moore writes: > IronPython and Jython can retain UTF-16 as their native form if that > makes interop cleaner, but in doing so they need to ensure that basic > operations like indexing and len work in terms of code points, not > code units, if they are to conform.
[...] > They lose the O(1) guarantee, but that's easily defensible as a > tradeoff to conform to underlying runtime semantics. Unfortunately, I don't think it's all that easy to defend. Absent PEP 393 or a restriction to the characters in the BMP, this is a very expensive change, easily visible to interactive users, let alone performance-hungry applications. I personally do advocate the "array of code points" definition, but I don't use IronPython or Jython so PEP 393 is as close to heaven as I expect to get. OTOH, I also use Emacsen with Mule, and I have to admit that there is a perceptible performance hit in any large (>1 MB) buffer containing non-ASCII characters vs. pure ASCII (the code unit in Mule is 1 byte). I expect that if IronPython and Jython really want to retain native, code-unit-based representations, it's going to be painful to conform to an "array of code points" specification. There may need to be a compromise of the form "Implementations SHOULD provide an implementation of str that is both O(1) in indexing and an array of code points. Code that is Unicode-ly correct in Python implementing PEP 393 will need to be ported with some effort to implementations that do not satisfy this requirement, perhaps using different algorithms or extra libraries." _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com