On 6/3/07, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote: > Sure - but how can Python tell whether a non-normalized string was > intentionally put into the source, or as a side effect of the editor > modifying it?
It can't, but does it really need to? It could always assume the latter. > In most cases, it won't matter. If it does, it should be explicit > in the code, e.g. by putting an n() function around the string > literal. This is only almost true. Consider these two hypothetical files written by naive newbies: data.py: favorite_colors = {'Martin Löwis': 'blue'} code.py: import data print data.favorite_colors['Martin Löwis'] Now if these are written by two different people using different editors, one might be normalized in a different way than the other, and the code would look all right but mysteriously fail to work. Even more mysteriously, when the files are opened and saved (possibly even automatically) by one of the people without any changes, the code would then start to work. And magically break again when the other person edits one of the files. The most important thing about normalization is that it should be consistent for internal strings. Similarly when reading in a text file, you really should normalize it first, if you're going to handle it as *text*, not binary. The most common normalization is NFC, because it works best everywhere and causes the least amount of surprise. E.g. "Löwis"[2] results in "w", not in u'\u0308' (COMBINING DIAERESIS), which most naive users won't expect. > Also, there is still room for subtle issues, e.g. when concatenating > two normalized strings will produce a string that isn't normalized. Sure: >>> from unicodedata import normalize as n >>> a=n('NFD', u'ö'); n('NFC', a[0])+n('NFC', a[1:]) == n('NFC', a) False But a partial solution is better than no solution. > Also, in many cases, strings come from IO, not from source, so if > it is important that they are in NFC, you need to normalize anyway. Indeed, and it would be best if this happened automatically, like handling of line endings. It doesn't need to always work, just most of the time. I haven't read description of Python's syntax, but this happens with Python 2.5: test.py: a = """ """ print repr(a) Output: '\n' The line ending there is '\r\n', and Python normalizes it when reading in the source code, even though '\r\n' matters even less than doing NFC normalization. _______________________________________________ Python-3000 mailing list Python-3000@python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com