Steven D'Aprano writes: [long example]
> Am I right so far? > > So the email package uses the surrogate-escape error handler and ends up > with this Unicode string: > > 'Subject: \udc9c\udc80\udce2NOBODY expects the Spanish Inquisition!”' > > which can be encoded back to the bytes we started with. Yes. > Note that technically those three \u... code points are NOT classified > as "noncharacters". Very unpythonic terminology, easily confusing the nonspecialist. Or the specialist -- I used to know that Unicode gave "noncharacter" a technical definition but it seems I forgot. But then, Unicode isn't a PSF product, so I guess it's OK to be unpythonic.<wink/> > They are actually surrogate code points: > > http://www.unicode.org/faq/private_use.html#nonchar4 > http://www.unicode.org/glossary/#surrogate_code_point > > and they're supposed to be reserved for UTF-16. I'm not sure of the > implication of that. It means that any Python program that invokes the surrogateescape handler is not a "conforming Unicode process", at least not on the naive interpretation of that definition. A conforming process would interpret them as corrupt characters and raise as soon as detected. A more sophisticated interpretation might argue that Python is multiple processes (in the sense of "process" used by Unicode), and that the Unicode standard only applies to characters. This is especially true of Pythons implementing PEP 393, since no surrogates should ever appear in text[1] at all. Then the smuggled bytes can be treated as noncharacters in practice although technically it's a violation of the Unicode standard to do so. Footnotes: [1] Meaning, no fair using chr() to inject them into str! _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com