Should canonicalization should be an extra feature of the Text IO, on par with character encoding?
On 6/20/07, Daniel Stutzbach <[EMAIL PROTECTED]> wrote: > On 6/20/07, Bill Janssen <[EMAIL PROTECTED]> wrote: [For the TextIO, as opposed to the raw IO, Bill originally proposed dropping read(n), because character count is not well-defined. Dan objected that not all text has useful line breaks.] > > ... just saying "give me N characters" isn't enough. > > We need to say, "N characters assuming a text > > encoding of M, with a normalization policy of Q, > > and a newline policy of R". [ Daniel points out that TextIO already handles M and R ] > I'm not sure I 100% understand what you mean by > "normalization policy" (Q). Could you give an example? How many characters are there in ö? If I ask for just one character, do I get only the o, without the diaeresis, or do I get both (since they are linguistically one letter), or does it depend on how some editor happened to store it? Distinguishing strings based on an accident of storage would violate unicode standards. (More precisely, it would be a violation of standards to assume that they are distinguished.) To the extent that you are treating the data as text rather than binary, NFC or NFD normalization should always be appropriate. In practice, binary concerns do intrude even for text data; you may well want to save it back out in the original encoding, without any spurious changes. Proposal: open would default to NFC. import would open source code with NFKC. An explict None canonicalization would allow round-trips without spurious binary-level changes. -jJ _______________________________________________ Python-3000 mailing list Python-3000@python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com