On 6/21/07, Jim Jewett <[EMAIL PROTECTED]> wrote: > Should canonicalization should be an extra feature of the Text IO, on > par with character encoding? > > On 6/20/07, Daniel Stutzbach <[EMAIL PROTECTED]> wrote: > > On 6/20/07, Bill Janssen <[EMAIL PROTECTED]> wrote: > > [For the TextIO, as opposed to the raw IO, Bill originally proposed > dropping read(n), because character count is not well-defined. Dan > objected that not all text has useful line breaks.] > > > > ... just saying "give me N characters" isn't enough. > > > We need to say, "N characters assuming a text > > > encoding of M, with a normalization policy of Q, > > > and a newline policy of R". > > [ Daniel points out that TextIO already handles M and R ] > > > I'm not sure I 100% understand what you mean by > > "normalization policy" (Q). Could you give an example? > > How many characters are there in ö? > > If I ask for just one character, do I get only the o, without the > diaeresis, or do I get both (since they are linguistically one > letter), or does it depend on how some editor happened to store it?
It should get you the next code unit as it comes out of the incremental codec. (Did you see my semantic model I described in a different thread?) > Distinguishing strings based on an accident of storage would violate > unicode standards. (More precisely, it would be a violation of > standards to assume that they are distinguished.) I don't give a damn about this requirement of the Unicode standard. At least, I don't think Python should enforce it at the level of the str data type, and that includes str objects returned by the I/O library. > To the extent that you are treating the data as text rather than > binary, NFC or NFD normalization should always be appropriate. > > In practice, binary concerns do intrude even for text data; you may > well want to save it back out in the original encoding, without any > spurious changes. > > Proposal: > > open would default to NFC. > > import would open source code with NFKC. > > An explict None canonicalization would allow round-trips without > spurious binary-level changes. Counter-proposal: normalization is provided as library functionality. Applications are responsible for normalization data when they need it to be normalized and they can't be sure that it isn't already normalized. The source parser used by import and a few other places is an "application" in this sense and can certainly apply whatever normalization is required. Have we agreed on the level of normalization for source code yet? I'm pretty sure we have agreed on *when* it happens, i.e. (logically) before the lexer starts scanning the source code. I would not be against an additional optional layer in the I/O stack that applies normalization. We could even have an optional parameter to open() to push this onto the stack. But I don't think it should be the default. What is the status of normalization in Java? Does Java source code get normalized before it is parsed? What if \u.... is used? Do the Java I/O library classes normalize text? -- --Guido van Rossum (home page: http://www.python.org/~guido/) _______________________________________________ Python-3000 mailing list Python-3000@python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com