Nathaniel Smith writes: > It's also nice to be able to parse some HTML data, make a few changes > in memory, and then serialize it back to HTML. Having this crash on > random documents is rather irritating, esp. if these documents are > standards-compliant HTML as in this case.
This example doesn't make sense to me. Why would *conformant* HTML crash the codec? Unless you're saying the source is non-conformant and *lied* about the encoding? Then errors=surrogateescape should do what you want here, no? If not, new codecs won't help you---the "crash" is somewhere else. Similarly, Soni's use case of control characters for formatting in an IRC client. If they're C0, then AFAICT all of the ASCII-compatible codecs do pass all of those through.[1] If they're C1, then you've got big trouble because the multibyte encodings will either error due to a malformed character or produce an unintended character (except for UTF-8, where you can encode the character in UTF-8). The windows-* encodings are quite inconsistent about the graphics they put in C1 space as well as where they leave holes, so this is not just application-specific, it's even encoding-specific behavior. The more examples of claimed use cases I see, the more I think most of them are already addressed more safely by Python's existing mechanisms, and the less I see a real need for this in the stdlib, with the single exception that WHAT-WG may be a better authority to follow than Microsoft for windows-* codecs. Footnotes: [1] I don't like that much, I'd rather restrict to the ones that have universally accepted semantics including CR, LF, HT, ESC, BEL, and FF. But passthrough is traditional there, a few more are in somewhat common use, and I'm not crazy enough to break backward compatibility. _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/