On 26 August 2011 03:52, Guido van Rossum <gu...@python.org> wrote: > I know that by now I am repeating myself, but I think it would be > really good if we could get rid of this ambiguity. PEP 393 seems the > best way forward, even if it doesn't directly address what to do for > IronPython or Jython, both of which have to deal with a pervasive > native string type that contains UTF-16.
Hmm, I'm completely naive in this area, but from reading the thread, would a possible approach be to say that Python (the language definition) is defined in terms of code points (as we already do, even if the wording might benefit from some clarification). Then, under PEP 393, and currently in wide builds, CPython conforms to that definition (and retains the property of basic operations being O(1), which is not in the language definition but is a user expectation and your expressed requirement). IronPython and Jython can retain UTF-16 as their native form if that makes interop cleaner, but in doing so they need to ensure that basic operations like indexing and len work in terms of code points, not code units, if they are to conform. Presumably this will be easier than moving to a UCS-4 representation, as they can defer to runtime support routines via interop (which presumably get this right - or at the very least can be blamed for any errors :-)) They lose the O(1) guarantee, but that's easily defensible as a tradeoff to conform to underlying runtime semantics. Does this make sense, or have I completely misunderstood things? Paul. PS Thanks to all for the discussion in general, I'm learning a lot about Unicode from all of this! _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com