> So let me explain it. I see two different sequences of code points: > 'L', '\u00F6', 'w', 'i', 's' on the one hand, and 'L', 'o', '\u0308', > 'w', 'i', 's' on the other. Never mind that Unicode has semantics that > claim they are equivalent. They are two different sequences of code > points.
If they were sequences of integers, or sequences of bytes, I'd agree with you. But they are explicitly sequences of characters, not sequences of codepoints. There should be one internal normalized form for strings. > We should not hide that Python's unicode string object can > store each sequence of code points equally well, and that when viewed > as a sequence they are different: the first has len() == 5, the scond > has len() == 6! We should definitely not expose that difference! > When read from a file they are different. A file is in UTF-8, or UTF-2, or whatever -- it contains a string coerced to a sequence of bits. Whatever reads that file should in fact either preserve that sequence of bytes (in which case it's not a string), or coerce it to a Unicode string, in which case the file representation is immaterial and the Python normalized form is used internally. > I might be > writing either literal with the expectation to get exactly that > sequence of code points, in order to use it as a test case or as input > for another program that requires specific input. In that case you should write it as a sequence of integers, because that's what you're dealing with. > There's a simpler solution. The unicode (or str, in Py3k) data type > represents a sequence of code points, not a sequence of characters. > This has always been the case, and will continue to be the case. Bad idea, IMO. Bill _______________________________________________ Python-3000 mailing list [email protected] http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com
