Guido van Rossum writes: > > If I ask for just one character, do I get only the o, without the > > diaeresis, or do I get both (since they are linguistically one > > letter), or does it depend on how some editor happened to store it? > > It should get you the next code unit as it comes out of the > incremental codec. (Did you see my semantic model I described in a > different thread?)
I don't like this<wink>, but since that's the way it's gonna be ... > > Distinguishing strings based on an accident of storage would violate > > unicode standards. (More precisely, it would be a violation of > > standards to assume that they are distinguished.) > > I don't give a damn about this requirement of the Unicode standard. ... this requirement does not apply to the Python str type as you have described it. I think at this stage we're asking for trouble to have any normalization by default, even in the TextIO module. str is not text, it's an array of code units. str is going to be used to implement codecs, I/O buffers, all kinds of things that don't necessarily have Unicode text semantics. Unless the Python language itself defines the semantics of the array of code units, EIBTI. This accords with Martin's statement about identifiers being the only thing he proposed normalizing. Even if we know a user wants text, I don't see any state of the art that allows us to guess which normalization will be most useful to him. I think for identifiers, NFKC is almost a no-brainer. But for strings it is not at all obvious. NFC violates such useful string invariants such as len(a) + len(b) == len(a+b). AFAICS, NKD does not. OTOH, if you don't need strings to obey array invariants, NFC is much more friendly to "dumb" UIs that just display the characters as they get them, without trying to find an equivalent that is in the font for missing charactes. And it seems plausible that some applications will mix normalizations inside of the Python instance. The app must handle this; Python can't. Even if you carry normalization information around with your str object, what normalization is Python supposed to apply to nfd_str + nfc_str? But surely that operation is permissible! > > In practice, binary concerns do intrude even for text data; you may > > well want to save it back out in the original encoding, without any > > spurious changes. Then for the purposes of this discussion, it's not text, it's binary. In many cases it will need to be read as bytes and stored that way until written back out. Ie, many legacy encodings do not support roundtrips, such as those that use ISO 2022 extension techniques: there's no rule against having a mode-changing sequence and its inverse in succession, and it's occasionally seen in the wild. Even UTF-8 has unnormalized representations for many characters, and it was only recently that Unicode came to require that they be treated as errors, and not interpreted (producing them has always been forbidden). _______________________________________________ Python-3000 mailing list Python-3000@python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com