On Fri, Jan 19, 2018 at 5:30 AM, M.-A. Lemburg <m...@egenix.com> wrote:
> On 19.01.2018 05:38, Nathaniel Smith wrote: > > On Thu, Jan 18, 2018 at 7:51 PM, Guido van Rossum <gu...@python.org> > wrote: > >> Can someone explain to me why this is such a controversial issue? > > > > I guess practicality versus purity is always controversial :-) > > > >> It seems reasonable to me to add new encodings to the stdlib that do the > >> roundtripping requested in the first message of the thread. As long as > they > >> have new names that seems to fall under "practicality beats purity". > > There are a few issues here: > > * WHATWG encodings are mostly for decoding content in order to > show it in the browser, accepting broken encoding data. > And sometimes Python apps that pull data from the web. > Python already has support for this by using one of the available > error handlers, or adding new ones to suit the needs. > This seems cumbersome though. > If we'd add the encodings, people will start creating more > broken data, since this is what the WHATWG codecs output > when encoding Unicode. > That's FUD. Only apps that specifically use the new WHATWG encodings would be able to consume that data. And surely the practice of web browsers will have a much bigger effect than Python's choice. > As discussed, this could be addressed by making the WHATWG > codecs decode-only. > But that would defeat the point of roundtripping, right? > * The use case seems limited to implementing browsers or headless > implementations working like browsers. > > That's not really general enough to warrant adding lots of > new codecs to the stdlib. A PyPI package is better suited > for this. > Perhaps, but such a package already exists and its author (who surely has read a lot of bug reports from its users) says that this is cumbersome. > * The WHATWG codecs do not only cover simple mapping codecs, > but also many multi-byte ones for e.g. Asian languages. > > I doubt that we'd want to maintain such codecs in the stdlib, > since this will increase the download sizes of the installers > and also require people knowledgeable about these variants > to work on them and fix any issues. > Really? Why is adding a bunch of codecs so much effort? Surely the translation tables contain data that compresses well? And surely we don't need a separate dedicated piece of C code for each new codec? > Overall, I think either pointing people to error handlers > or perhaps adding a new one specifically for the case of > dealing with control character mappings would provide a better > maintenance / usefulness ratio than adding lots of new > legacy codecs to the stdlib. > Wouldn't error handlers be much slower? And to me it seems a new error handler is a much *bigger* deal than some new encodings -- error handlers must work for *all* encodings. > BTW: WHATWG pushes for always using UTF-8 as far as I can tell > from their website. > As does Python. But apparently it will take decades more to get there. > >> (Modifying existing encodings seems wrong -- did the feature request > somehow > >> transmogrify into that?) > > > > Someone did discover that Microsoft's current implementations of the > > windows-* encodings matches the WHAT-WG spec, rather than the Unicode > > spec that Microsoft originally wrote. > > No, MS implements somethings called "best fit encodings" > and these are different than what WHATWG uses. > > Unlike the WHATWG encodings, these are documented as vendor encodings > on the Unicode site, which is what we normally use as reference > for out stdlib codecs. > > However, whether these are actually a good idea, is open to discussion > as well, since they sometimes go a bit far with "best fit", e.g. > mapping the infinity symbol to 8. > > Again, using the error handles we have for dealing with > situations which require non-standard encoding behavior are > the better approach: > > https://docs.python.org/3.7/library/codecs.html#error-handlers > > Adding new ones is possible as well. > > > So there is some argument that > > the Python's existing encodings are simply out of date, and changing > > them would be a bugfix. (And standards aside, it is surely going to be > > somewhat error-prone if Python's windows-1252 doesn't match everyone > > else's implementations of windows-1252.) But yeah, AFAICT the original > > requesters would be happy either way; they just want it available > > under some name. > > The encodings are not out of date. I don't know where you got > that impression from. > > The Windows API WideCharToMultiByte which was quoted in the discussion: > > https://msdn.microsoft.com/en-us/library/windows/desktop/ > dd374130%28v=vs.85%29.aspx > > unfortunately uses the above mentioned best fit encodings, > but this can and should be switched off by specifying the > WC_NO_BEST_FIT_CHARS for anything that requires validation > or needs to be interoperable: > > """ > For strings that require validation, such as file, resource, and user > names, the application should always use the WC_NO_BEST_FIT_CHARS flag. > This flag prevents the function from mapping characters to characters > that appear similar but have very different semantics. In some cases, > the semantic change can be extreme. For example, the symbol for "∞" > (infinity) maps to 8 (eight) in some code pages. > """ > > -- > Marc-Andre Lemburg > eGenix.com > > Professional Python Services directly from the Experts (#1, Jan 19 2018) > >>> Python Projects, Coaching and Consulting ... http://www.egenix.com/ > >>> Python Database Interfaces ... http://products.egenix.com/ > >>> Plone/Zope Database Interfaces ... http://zope.egenix.com/ > ________________________________________________________________________ > > ::: We implement business ideas - efficiently in both time and costs ::: > > eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 > D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg > Registered at Amtsgericht Duesseldorf: HRB 46611 > http://www.egenix.com/company/contact/ > http://www.malemburg.com/ > > -- --Guido van Rossum (python.org/~guido)
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/