On 12 January 2018 at 14:55, Steve Dower <steve.do...@python.org> wrote: > On 12Jan2018 0342, Random832 wrote: >> >> On Thu, Jan 11, 2018, at 04:55, Serhiy Storchaka wrote: >>> >>> The way of solving this issue in Python is using an error handler. The >>> "surrogateescape" error handler is specially designed for lossless >>> reversible decoding. It maps every unassigned byte in the range >>> 0x80-0xff to a single character in the range U+dc80-U+dcff. This allows >>> you to distinguish correctly decoded characters from the escaped bytes, >>> perform character by character processing of the decoded text, and >>> encode the result back with the same encoding. >> >> Maybe we need a new error handler that maps unassigned bytes in the range >> 0x80-0x9f to a single character in the range U+0080-U+009F. Do any of the >> encodings being discussed have behavior other than the "normal" version of >> the encoding plus what I just described? > > > +1 on this being an error handler (if possible). I suspect the semantics > will be more complex than suggested above, but as this seems to be able > handling normally un[en/de]codable characters, using an error handler to > return something more sensible best represents what is going on. Call it > something like 'web' or 'relaxed' or 'whatwg'. > > I don't know if error handlers have enough context for this though. If not, > we should ensure they can have it. I'd much rather explain one new error > handler to most people (and a more complex API for implementing them to the > few people who do it) than explain a whole suite of new encodings.
+1 from me, which shifts my position to be: 1. If we can make a decoding-only error handler that does the desired thing in combination with our existing codecs, lets do that (perhaps using a name like "controlpass", since the intent is to pass through otherwise unassigned latin-1 control characters, similar to the way "surrogatepass" allows lone surrogates) 2. Only if 1 fails for some reason would we look at adding the extra decode-only codec variants. Given the power of errors handlers, though, I expect the surrogatepass-style error handler approach will work (see https://docs.python.org/3/library/codecs.html#codecs.register_error and https://docs.python.org/3/library/exceptions.html#UnicodeError for an overview of the information they're given and what they can do about it). Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/