> The path of least surprise for legacy encodings might be for > the codecs to produce whatever is closest to the original encoding > if possible. I.e. what was one code point would remain one code > point, and if that's not possible then normalize. I don't know if > this is any different from always normalizing (it certainly is > the same for Latin-1).
Depends on the normalization form. For Latin 1, the straight-forward codec produces output that is not in NFKC, as MICRO SIGN should get normalized to GREEK SMALL LETTER MU. However, it is normalized under NFC. Not sure about other codecs; for the CJK ones, I would expect to see all sorts of issues. > Always normalizing would have the advantage of simplicity (no matter > what the encoding, the result is the same), and I think that is > the real path of least surprise if you sum over all surprises. I'd like to repeat that this is out of scope of this PEP, though. This PEP doesn't, and shouldn't, specify how string literals get from source to execution. > FWIW, I looked at what Java and XML 1.1 do, and they *don't* normalize > for some reason. For XML, I believe the reason is performance. It is *fairly* expensive to compute NFC in the general case, and I'm yet uncertain what a good way would be to reduce execution cost in the "common case" (i.e. data is already in NFC). For XML, enforcing this performance hit on top of the already costly processing of XML would be unacceptable. Regards, Martin _______________________________________________ Python-3000 mailing list Python-3000@python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com