OK, I will tune out this conversation. It is clearly not going anywhere. On Fri, Jan 19, 2018 at 9:12 AM, Rob Speer <rsp...@luminoso.com> wrote:
> Error handlers are quite orthogonal to this problem. If you try to solve > this problem with an error handler, you will have a different problem. > > Suppose you made "c1-control-passthrough" or whatever into an error > handler, similar to "replace" or "ignore", and then you encounter an > unassigned character that's *not* in the range 0x80 to 0x9f. (Many > encodings have these.) Do you replace it? Do you ignore it? You don't know > because you just replaced the error handler with something that's not about > error handling. > > I will also repeat that having these encodings (in both directions) will > provide more ways for Python to *reduce* the amount of mojibake that > exists. If acknowledging that mojibake exists offends your sense of purity, > and you'd rather just destroy all mojibake at the source... that's great, > and please get back to me after you've fixed Microsoft Excel. > > I hope to make a pull request shortly that implements these mappings as > new encodings that work just like the other ones. > > On Fri, 19 Jan 2018 at 11:54 M.-A. Lemburg <m...@egenix.com> wrote: > >> On 19.01.2018 17:20, Guido van Rossum wrote: >> > On Fri, Jan 19, 2018 at 5:30 AM, M.-A. Lemburg <m...@egenix.com >> > <mailto:m...@egenix.com>> wrote: >> > >> > On 19.01.2018 05:38, Nathaniel Smith wrote: >> > > On Thu, Jan 18, 2018 at 7:51 PM, Guido van Rossum < >> gu...@python.org <mailto:gu...@python.org>> wrote: >> > >> Can someone explain to me why this is such a controversial issue? >> > > >> > > I guess practicality versus purity is always controversial :-) >> > > >> > >> It seems reasonable to me to add new encodings to the stdlib >> that do the >> > >> roundtripping requested in the first message of the thread. As >> long as they >> > >> have new names that seems to fall under "practicality beats >> purity". >> > >> > There are a few issues here: >> > >> > * WHATWG encodings are mostly for decoding content in order to >> > show it in the browser, accepting broken encoding data. >> > >> > >> > And sometimes Python apps that pull data from the web. >> > >> > >> > Python already has support for this by using one of the available >> > error handlers, or adding new ones to suit the needs. >> > >> > >> > This seems cumbersome though. >> >> Why is that ? >> >> Python 3 uses such error handlers for most of the I/O that's done >> with the OS already and for very similar reasons: dealing with >> broken data or broken configurations. >> >> > If we'd add the encodings, people will start creating more >> > broken data, since this is what the WHATWG codecs output >> > when encoding Unicode. >> > >> > >> > That's FUD. Only apps that specifically use the new WHATWG encodings >> > would be able to consume that data. And surely the practice of web >> > browsers will have a much bigger effect than Python's choice. >> >> It's not FUD. I don't think we ought to encourage having >> Python create more broken data. The purpose of the WHATWG >> encodings is to help browsers deal with decoding broken >> data in a uniform way. It's not to generate more such data. >> >> That may be regarded as purists view, but also has a very >> practical meaning. The output of the codecs will only readable >> by browsers implementing the WHATWG encodings. Other tools >> receiving the data will run into the same decoding problems. >> >> Once you have Unicode, it's better to stay there and use >> UTF-8 for encoding to avoid any such issues. >> >> > As discussed, this could be addressed by making the WHATWG >> > codecs decode-only. >> > >> > >> > But that would defeat the point of roundtripping, right? >> >> Yes, intentionally. Once you have Unicode, the data should >> be encoded correctly back into UTF-8 or whatever legacy encoding >> is needed, fixing any issues while in Unicode. >> >> As always, it's better to explicitly address such problems than >> to simply punt on them and write back broken data. >> >> > * The use case seems limited to implementing browsers or headless >> > implementations working like browsers. >> > >> > That's not really general enough to warrant adding lots of >> > new codecs to the stdlib. A PyPI package is better suited >> > for this. >> > >> > >> > Perhaps, but such a package already exists and its author (who surely >> > has read a lot of bug reports from its users) says that this is >> cumbersome. >> >> The only critique I read was that registering the codecs >> is not explicit enough, but that's really only a nit, since >> you can easily have the codec package expose a register >> function which you then call explicitly in the code using >> the codecs. >> >> > * The WHATWG codecs do not only cover simple mapping codecs, >> > but also many multi-byte ones for e.g. Asian languages. >> > >> > I doubt that we'd want to maintain such codecs in the stdlib, >> > since this will increase the download sizes of the installers >> > and also require people knowledgeable about these variants >> > to work on them and fix any issues. >> > >> > >> > Really? Why is adding a bunch of codecs so much effort? Surely the >> > translation tables contain data that compresses well? And surely we >> > don't need a separate dedicated piece of C code for each new codec? >> >> For the simple charmap style codecs that's true. Not so for the >> Asian ones and the latter also do require dedicated C code (see >> Modules/cjkcodecs). >> >> > Overall, I think either pointing people to error handlers >> > or perhaps adding a new one specifically for the case of >> > dealing with control character mappings would provide a better >> > maintenance / usefulness ratio than adding lots of new >> > legacy codecs to the stdlib. >> > >> > >> > Wouldn't error handlers be much slower? And to me it seems a new error >> > handler is a much *bigger* deal than some new encodings -- error >> > handlers must work for *all* encodings. >> >> Error handlers have a standard interface and so they will work >> for all codecs. Some codecs limits the number of handlers that >> can be used, but most accept all registered handlers. >> >> If a handler is too slow in Python, it can be coded in C for >> speed. >> >> > BTW: WHATWG pushes for always using UTF-8 as far as I can tell >> > from their website. >> > >> > >> > As does Python. But apparently it will take decades more to get there. >> >> Yes indeed, so let's not add even more confusion by adding more >> variants of the legacy encodings. >> >> -- >> Marc-Andre Lemburg >> eGenix.com >> >> Professional Python Services directly from the Experts (#1, Jan 19 2018) >> >>> Python Projects, Coaching and Consulting ... http://www.egenix.com/ >> >>> Python Database Interfaces ... http://products.egenix.com/ >> >>> Plone/Zope Database Interfaces ... http://zope.egenix.com/ >> ________________________________________________________________________ >> >> ::: We implement business ideas - efficiently in both time and costs ::: >> >> eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 >> <https://maps.google.com/?q=Pastor-Loeh-Str.48+%0D+%C2%A0+%C2%A0+D-40764+Langenfeld,+Germany&entry=gmail&source=g> >> D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg >> Registered at Amtsgericht Duesseldorf: HRB 46611 >> http://www.egenix.com/company/contact/ >> http://www.malemburg.com/ >> >> _______________________________________________ >> Python-ideas mailing list >> Python-ideas@python.org >> https://mail.python.org/mailman/listinfo/python-ideas >> Code of Conduct: http://python.org/psf/codeofconduct/ >> > -- --Guido van Rossum (python.org/~guido)
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/