Am 25.08.2011 11:39, schrieb Stephen J. Turnbull: > "Martin v. Löwis" writes: > > > No, that's explicitly *not* what C6 says. Instead, it says that a > > process that treats s1 and s2 differently shall not assume that others > > will do the same, i.e. that it is ok to treat them the same even though > > they have different code points. Treating them differently is also > > conforming. > > Then what requirement does C6 impose, in your opinion?
In IETF terminology, it's a weak SHOULD requirement. Unless there are reasons not to, equivalent strings should be treated differently. It's a weak requirement because the reasons not to treat them equivalent are wide-spread. > - Ideally, an implementation would *always* interpret two > canonical-equivalent sequences *identically*. There are practical > circumstances under which implementations may reasonably distinguish > them. (Emphasis mine.) Ok, so let me put emphasis on *ideally*. They acknowledge that for practical reasons, the equivalent strings may need to be distinguished. > The examples given are things like "inspecting memory representation > structure" (which properly speaking is really outside of Unicode > conformance) and "ignoring collation behavior of combining sequences > outside the repertoire of a specified language." That sounds like > "Special cases aren't special enough to break the rules. Although > practicality beats purity." to me. Treating things differently is an > exceptional case, that requires sufficient justification. And the common justification is efficiency, along with the desire to support the representation of unnormalized strings (else there would be an efficient implementation). > If our process is working with an external process (the OS's file > system driver) whose definition includes the statement that "File > names are sequences of Unicode characters", then C6 says our process > must compare canonically equivalent sequences that it takes to be file > names as the same, whether or not they are in the same normalized > form, or normalized at all, because we can't assume the file system > will treat them as different. It may well happen that this requirement is met in a plain Python application. If the file system and GUI libraries always return NFD strings, then the Python process *will* compare equivalent sequences correctly (since it won't ever get any other representations). > *Users* will certainly take the viewpoint that two strings that > display the same on their monitor should identify the same file when > they use them as file names. Yes, but that's the operating system's choice first of all. Some operating systems do allow file names in a single directory that are equivalent yet use different code points. Python then needs to support this operating system, despite the permission of the Unicode standard to ignore the difference. > I'm simply saying that the current > implementation of strings, as improved by PEP 393, can not be said to > be conforming. I continue to disagree. The Unicode standard deliberately allows Python's behavior as conforming. > I would like to see something much more conformant done as a separate > library (the Python Components for Unicode, say), intended to support > users who need character-based behavior, Unicode-ly correct collation, > etc., more than efficiency. Wrt. normalization, I think all that's needed is already there. Applications just need to normalize all strings to a normal form of their liking, and be done. That's easier than using a separate library throughout the code base (let alone using yet another string type). Regards, Martin _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com