On 6/5/07, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote: > Jim Jewett schrieb: > > On 6/5/07, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote: > >> > Always normalizing would have the advantage of simplicity (no > >> > matter what the encoding, the result is the same), and I think > >> > that is the real path of least surprise if you sum over all > >> > surprises.
> >> I'd like to repeat that this is out of scope of this PEP, though. > >> This PEP doesn't, and shouldn't, specify how string literals get > >> from source to execution. > > I see that as a gray area. > Please read the PEP title again. What is unclear about > "Supporting Non-ASCII Identifiers"? That strings can also be used as identifiers. > > Unicode does say pretty clearly that (at least) canonical equivalents > > must be treated the same. > Chapter and verse, please? I am pretty sure this list is not exhaustive, but it may be helpful: The Identifiers Annex http://www.unicode.org/reports/tr31/ """ UAX31-C2. An implementation claiming conformance to Level 1 of this specification shall describe which of the following it observes: R1 Default Identifiers R2 Alternative Identifiers R3 Pattern_White_Space and Pattern_Syntax Characters R4 Normalized Identifiers R5 Case-Insensitive Identifiers """ I interpret this as "If we normalize the Identifiers, then we must observe R4." R4 lets us exclude individual characters from normalization, but it says that two IDs with the same Normalization Form are equivalent, unless they include specifically excluded characters. """ R4 Normalized Identifiers To meet this requirement, an implementation shall specify the Normalization Form and shall provide a precise list of any characters that are excluded from normalization. If the Normalization Form is NFKC, the implementation shall apply the modifications in Section 5.1, NFKC Modifications, given by the properties XID_Start and XID_Continue. Except for identifiers containing excluded characters, any two identifiers that have the same Normalization Form shall be treated as equivalent by the implementation. """ Additional Support: The Normalization Annex http://www.unicode.org/reports/tr15/ near the end of section 1 (but before 1.1) """ Normalization Forms KC and KD must not be blindly applied to arbitrary text. """ ... """ They can be applied more freely to domains with restricted character sets, such as in Section 13, Programming Language Identifiers. """ (section 13 then forwards back to UAX31) TR 15, section 19, numbered paragraph 3 """ Higher-level processes that transform or compare strings, or that perform other higher-level functions, must respect canonical equivalence or problems will result. """ Looking at the main standard, I revert to Unicode 4 because it is online at http://www.unicode.org/versions/Unicode4.0.0/ 2.2 Equivalent Sequences """ ... If an application or user attempts to distinguish non-identical sequences which are nonetheless considered to be equivalent sequences, as shown in the examples in Figure 2-6, it would not be guaranteed that other applications or users would recognize the same distinctions. To prevent introducing interoperability problems between applications, such distinctions must be avoided wherever possible. """ which is echoed in chapter 3 (conformance) """ C9 A process shall not assume that the interpretations of two canonical-equivalent character sequences are distinct. ... Ideally, an implementation would always interpret two canonical-equivalent character sequences identically. There are practical circumstances under which implementations may reasonably distinguish them. """ """ C10 When a process purports not to modify the interpretation of a valid coded character representation, it shall make no change to that coded character representation other than the possible replacement of character sequences by their canonical-equivalent sequences or the deletion of noncharacter code points. ... All processes and higher-level protocols are required to abide by C10 as a minimum. However, higher-level protocols may define additional equivalences that do not constitute modifications under that protocol. For example, a higher-level protocol may allow a sequence of spaces to be replaced by a single space. """ > > In theory, this could be done only to identifiers, but then it needs > > to be done inline for getattr. > Why that? The caller of getattr would need to apply normalization in > case the input isn't known to be normalized? OK, I suppose that might work, if documented, but ... it seems like another piece of boilerplate; when it isn't there, it won't really be because the input is normalized so after as it is because the author didn't think about normalization. -jJ _______________________________________________ Python-3000 mailing list Python-3000@python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com