I don't really understand what you're doing when you take a fragment of my sentence where I explain a wrong understanding of WHATWG encodings, and say "that's wrong, as you explain". I know it's wrong. That's what I was saying.
You quoted the part where I said "Filling in all the gaps with Latin-1", cut out the part where I said "is wrong", and replied with "that's wrong". I guess I'm glad we're in agreement, but this has been a strange bit of discourse. In this pseudocode that implements a "whatwg_error_mode", can you describe what the Python code to call it would look like? Does every call to .encode and .decode now have a "whatwg_error_mode" parameter, in addition to the "errors" parameter? Or are there twice as many possible strings you could pass as the "errors" parameter, so you can have "replace", "replace-whatwg", "surrogateescape", "surrogateescape-whatwg", etc? My objection here isn't efficiency, it's adding confusing extra options to .encode() and .decode() that aren't relevant in most cases. I'd like to limit this proposal to single-byte encodings, addressing the discrepancies in the C1 characters and possibly that Hebrew vowel point. If there are differences in the JIS encodings, that is a can of worms I'd like to not open at the moment. -- Rob Speer On Mon, 22 Jan 2018 at 01:43 Stephen J. Turnbull < turnbull.stephen...@u.tsukuba.ac.jp> wrote: > I don't expect to change your mind about the "right" way to deal with > this, but this is a more explicit description of what those of us who > advocate error handlers are thinking about. It may be useful in > writing your PEP (PEPs describe rejected counterproposals and > amendments along with adopted proposals and rationale in either case). > > Rob Speer writes: > > > > The question to my mind is whether or not this "latin1replace" > handler, > > > in conjunction with existing codecs, will do the same thing as the > > > WHATWG codecs. If I have understood you correctly, I think it will. > Have > > > I missed something? > > > > It won't do the same thing, and neither will the "chaining coders" > > proposal. > > The "chaining coders" proposal isn't well-enough specified to be sure. > > However, for practical purposes you may think of a Python *codec* as a > "whole array" decoder/encoder, and an *error handler* as a "token-by- > token" decoder/encoder. The distinction in type is for efficiency, of > course. Codecs can't be "chained" (I think, but I didn't think very > hard), but handlers can, in the sense that each handler can handle > some input values and delegate anything it can't deal with to the next > handler in the chain (under the hood handler implementationss are just > Python functions with a particular signature, so this is just "loop > until non-None"). > > > It's easy to miss details like this in all the counterproposals. > > I see no reason why a 'whatwgreplace' error handler with the logic > > # I am assuming decoding, and single-byte encodings. Encoding > # with 'html' error mode would insert format("&#%d;", ord(unicode)). > # Multibyte is a little harder. > > # ASCII bytes never error except maybe in UTF16, UTF32, Shift JIS > # and Big5. > assert the_byte >= 0x80 > # Handle C1 control characters. > if the_byte < 0xA0: > append_to_output(chr(the_byte)) > # Handle extended repertoire with a dict. > # This condition will depend on the particular codec. > elif the_byte in additional_code_points: > append_to_output(additional_code_points[the_byte]) > # Implement WHATWG error modes. > elif whatwg_error_mode is replacement: > append_to_output("\uFFFD") > else: > raise > > doesn't have the effect you want. This can be done in pure Python. > (Note: The actions in the pseudocode are not accurate. IIRC real > handlers take a UnicodeError as argument, and return a tuple of the > text to append to output and number of input tokens to skip, or > return None to indicate an unhandled error, rather than doing the > appending and raising themselves.) > > The main objection to doing it this way would be efficiency. To be > honest, I personally don't think that's an important objection since > this handler is frequently invoked only if the source text is badly > broken. (Remember, you'll already be greatly expanding the repertoire > of at least ASCII and ISO 8859/1 by promoting to windows-1252.) And > it would surely be "fast enough" if written in C. > > Caveat: I'm not sure I agree with MAL about windows-1255. I think > it's arguable that the WHAT-WG index is a better approximation to > reality, and I'd like to hear Hebrew speakers argue about that (I'm > not one). > > > The difference between WHATWG encodings and the ones in Python is, > > in all but one case, *only* in the C1 control character range (0x80 > > to 0x9F), > > Also in Japanese, where "corporate characters" have been added > (frequently twice, preventing round-tripping ... yuck) to the JIS > standard. I haven't checked the Chinese and Korean tables for similar > damage, but they're not quite as wacky about this stuff as the JISC > is, so they're probably OK (and of course Big5 was "corporate" from > the get-go). > > > a range of Unicode characters that has historically evaded > > standardization because they never had a clear purpose even before > > Unicode. Filling in all the gaps with Latin-1 > > That's wrong, as you explain: > > > [Eg, in Greek, some code points] are simply unassigned. Other > > software sometimes maps them to the Private Use Area, but this is > > not standardized at all, and it seems clear that Python should > > handle them with its usual error handler for unassigned > > bytes. (Which is one of the reasons not to replace the error > > handler with something different: we still need the error handler.) > > The logic above handles all this. As mentioned, a stdlib error > handler ('strict', 'replace', or 'xmlcharrefreplace' for WHAT-WG > conformance, or 'surrogatereplace' for the Pythonic equivalent of > mapping to the private area) could be chained if desired, and the > defaults could be changed and the names aliased to the WHAT-WG terms. > > This could be automated with a factory function that takes a list of > predefined handlers and composes them, although that would add another > layer of inefficiency (the composition would presumably be done in a > loop, and possibly using try although I think the error handler > convention is to return the text to insert if handled, and None if the > error can't be handled). > > Steve > >
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/