>> > Unicode does say pretty clearly that (at least) canonical equivalents >> > must be treated the same. > >> Chapter and verse, please? > > I am pretty sure this list is not exhaustive, but it may be helpful: > > The Identifiers Annex http://www.unicode.org/reports/tr31/
Ah, that's in the context of identifiers, not in the context of text in general. > """ > UAX31-C2. An implementation claiming conformance to Level 1 of this > specification shall describe which of the following it observes: > > R1 Default Identifiers > R2 Alternative Identifiers > R3 Pattern_White_Space and Pattern_Syntax Characters > R4 Normalized Identifiers > R5 Case-Insensitive Identifiers > """ > > I interpret this as "If we normalize the Identifiers, then we must > observe R4." R4 lets us exclude individual characters from > normalization, but it says that two IDs with the same Normalization > Form are equivalent, unless they include specifically excluded > characters. Correct, and that's indeed what PEP 3131 does. > """ > Normalization Forms KC and KD must not be blindly applied to arbitrary > text. > """ ... """ > They can be applied more freely to domains with restricted character > sets, such as in Section 13, Programming Language Identifiers. > """ > (section 13 then forwards back to UAX31) How is that a requirement that comparison should apply normalization? > TR 15, section 19, numbered paragraph 3 > """ > Higher-level processes that transform or compare strings, or that > perform other higher-level functions, must respect canonical > equivalence or problems will result. > """ That's not a mandatory requirement, but an "important aspect". Also, it applies to "higher-level processes"; I would expect that string comparison is not a higher-level function. Indeed, UAX#15 only gives definitions, no rules. > C9 A process shall not assume that the interpretations of two > canonical-equivalent character sequences are distinct. Right. What is "a process"? > ... > Ideally, an implementation would always interpret two > canonical-equivalent character sequences identically. There are > practical circumstances under which implementations may reasonably > distinguish them. > """ So it should be the application's choice. > """ > C10 When a process purports not to modify the interpretation of a > valid coded character representation, it shall make no change to that > coded character representation other than the possible replacement of > character sequences by their canonical-equivalent sequences or the > deletion of noncharacter code points. > ... > All processes and higher-level protocols are required to abide by C10 > as a minimum. However, higher-level protocols may define additional > equivalences that do not constitute modifications under that protocol. > For example, a higher-level protocol may allow a sequence of spaces to > be replaced by a single space. > """ So this *allows* to canonicalize strings, it doesn't *require* Python to do so. Indeed, doing so would be fairly expensive, and therefore it should not be done (IMO). >> Why that? The caller of getattr would need to apply normalization in >> case the input isn't known to be normalized? > > OK, I suppose that might work, if documented, but ... it seems like > another piece of boilerplate; when it isn't there, it won't really be > because the input is normalized so after as it is because the author > didn't think about normalization. No. It might also be because the author *knows* that the string is already normalized. Regards, Martin _______________________________________________ Python-3000 mailing list [email protected] http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com
