On 9/25/06, Josiah Carlson <[EMAIL PROTECTED]> wrote: > > gabor <[EMAIL PROTECTED]> wrote: > > Martin v. Löwis wrote: > > > Gábor Farkas schrieb:
> > >> should he write his own slicing/whatever functions to get consistent > > >> behaviour on linux/windows? > > now, for this to behave correctly on non-bmp characters, i will need to > > write a custom function, correct? As David Hopwood pointed out, to be fully correct, you already have to create a custom function even with bmp characters, because of decomposed characters. (Example: Representing a c-cedilla as a c and a combining cedilla, rather than as a single code point.) Separating those two would be wrong. Counting them as two characters for slicing purposes would usually be wrong. Even 32-bit representations are permitted to use surrogate pairs; it just doesn't often make sense. These are problems inherent to unicode (or at least to non-normalized unicode). Different python implementations may expose the problem in different places, but the problem is always there. We *could* specify that slicing and indexing act as though the underlying representation were normalized (and this would typically require normalization as part of construction), but I'm not sure that is the right answer. Even if it were trivial, there are reasons not to normalize. > It is important, arguably one of the most important pieces. But there > are three parts; 1) code points not currently defined within the unicode > spec, but who have specific encodings (based on the code point value), 2) > in the case of UTF-16 representations, Python's handling of characters > > 65535, 3) surrogates. > I believe #1 is handled "correctly" today, Martin sounds like he wants > #2 fixed for Py3k (I don't believe anyone *doesn't* want it fixed), and > #3 could be fixed while fixing #2 with a little more work (if desired). You also left out (4), decomposed characters, which is a more complex version of surrogates. Guido just stated that #2 is intentional, though he didn't pronounce that it should stay that way. There are sound arguments both ways. In particular, fixing it without fixing decomposed characters might incur the cost without the benefit. -jJ _______________________________________________ Python-3000 mailing list Python-3000@python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com