On 10.01.2018 20:13, Rob Speer wrote: > I was originally proposing these encodings under different names, and > that's what I think they should have. Indeed, that helps because a pip > installable library can backport the new encodings to previous versions of > Python. > > Having a pip installable library as the _only_ way to use these encodings > is the status quo that I am very familiar with. It's awkward. To use a > package that registers new codecs, you have to import something from that > package, even if you never call anything from what you imported, and that > makes flake8 complain. The idea that an encoding name may or may not be > registered, based on what has been imported, breaks our intuition about > reading Python code and is very hard to statically analyze.
You can have a function in the package which registers the codecs. That way you do have a call into the library and intuition is restored :-) (and flake should be happy as well): import mycodecs mycodecs.register() > I disagree with calling the WHATWG encodings that are implemented in every > Web browser "non-standard". WHATWG may not have a typical origin story as a > standards organization, but it _is_ the standards organization for the Web. I don't really want to get into a discussion here. Browsers use these modified encodings to cope with mojibake or web content which isn't quite standard compliant. That's a valid use case, but promoting such wrong use by having work-around encodings in the stdlib and having Python produce non-standard output doesn't strike me as a good way forward. We do have error handlers for dealing with partially corrupted data. I think that's good enough. > I'm really not interested in best-fit mappings that turn infinity into "8" > and square roots into "v". Making weird mappings like that sounds like a > job for the "unidecode" library, not the stdlib. Well, one of your main arguments was that the Windows API follows these best fit encodings. I agree that best fit may not necessarily be best fit for everyone :-) > On Wed, 10 Jan 2018 at 13:36 Rob Speer <rsp...@luminoso.com> wrote: > >> I'm looking at the documentation of "best fit" mappings, and that seems to >> be a different matter. It appears that best-fit mappings are designed to be >> many-to-one mappings used only for encoding. >> >> "Examples of best fit are converting fullwidth letters to their >> counterparts when converting to single byte code pages, and mapping the >> Infinity character to the number 8." (Mapping ∞ to 8? Seriously?!) It also >> does things such as mapping Cyrillic letters to Latin letters that look >> like them. >> >> This is not what I'm interested in implementing. I just want there to be >> encodings that match the WHATWG encodings exactly. If they have to be given >> a different name, that's fine with me. >> >> On Wed, 10 Jan 2018 at 03:38 M.-A. Lemburg <m...@egenix.com> wrote: >> >>> On 10.01.2018 00:56, Rob Speer wrote: >>>> Oh that's interesting. So it seems to be Python that's the exception >>> here. >>>> >>>> Would we really be able to add entries to character mappings that >>> haven't >>>> changed since Python 2.0? >>> >>> The Windows mappings in Python come directly from the Unicode >>> Consortium mapping files. >>> >>> If the Consortium changes the mappings, we can update them. >>> >>> If not, then we have a problem, since consumers are not only >>> the win32 APIs, but also other tools out there running on >>> completely different platforms, e.g. Java tools or web servers >>> providing downloads using the Windows code page encodings. >>> >>> Allowing such mappings in the existing codecs would then result >>> failures when the "other" sides see the decoded Unicode version and >>> try to encode back into the original encoding - you'd move the >>> problem from the Python side to the "other" side of the >>> integration. >>> >>> I had a look on the Unicode FTP site and they have since added >>> a new directory with mapping files they call "best fit": >>> >>> >>> http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/readme.txt >>> >>> The WideCharToMultiByte() defaults to best fit, but also offers >>> a mode where it operates in standards compliant mode: >>> >>> >>> https://msdn.microsoft.com/en-us/library/windows/desktop/dd374130%28v=vs.85%29.aspx >>> >>> See flag WC_NO_BEST_FIT_CHARS. >>> >>> Unicode TR#22 is also clear on this: >>> >>> https://www.unicode.org/reports/tr22/tr22-3.html#Illegal_and_Unassigned >>> >>> It allows such best fit mappings to make encodings round-trip >>> safe, but requires to keep these separate from the original >>> standard mappings: >>> >>> """ >>> It is very important that systems be able to distinguish between the >>> fallback mappings and regular mappings. Systems like XML require the use >>> of hex escape sequences (NCRs) to preserve round-trip integrity; use of >>> fallback characters in that case corrupts the data. >>> """ >>> >>> If you read the above section in TR#22 you quickly get reminded >>> of what the Unicode error handlers do (we basically implement >>> the three modes it mentions... raise, ignore, replace). >>> >>> Now, for unmapped sequences an error handler can opt for >>> using a fallback sequence instead. >>> >>> So in addition to adding best fit codecs, there's also the >>> option to add an error handler for best fit resolution of >>> unmapped sequences. >>> >>> Given the above, I don't think we ought to change the existing >>> standards compliant mappings, but use one of two solutions: >>> >>> a) add "best fit" encodings (see the Unicode FTP site for >>> a list) >>> >>> b) add an error handlers "bestfit" which implements the >>> fallback modes for the encodings in question >>> >>> >>>> On Tue, 9 Jan 2018 at 16:53 Ivan Pozdeev via Python-ideas < >>>> python-ideas@python.org> wrote: >>>> >>>>> First of all, many thanks for such a excellently writen letter. It was >>> a >>>>> real pleasure to read. >>>>> On 10.01.2018 0:15, Rob Speer wrote: >>>>> >>>>> Hi! I joined this list because I'm interested in filling a gap in >>> Python's >>>>> standard library, relating to text encodings. >>>>> >>>>> There is an encoding with no name of its own. It's supported by every >>>>> current web browser and standardized by WHATWG. It's so prevalent that >>> if >>>>> you ask a Web browser to decode "iso-8859-1" or "windows-1252", you >>> will >>>>> get this encoding _instead_. It is probably the second or third most >>> common >>>>> text encoding in the world. And Python doesn't quite support it. >>>>> >>>>> You can see the character table for this encoding at: >>>>> https://encoding.spec.whatwg.org/index-windows-1252.txt >>>>> >>>>> For the sake of discussion, let's call this encoding "web-1252". WHATWG >>>>> calls it "windows-1252", but notice that it's subtly different from >>>>> Python's "windows-1252" encoding. Python's windows-1252 has bytes that >>> are >>>>> undefined: >>>>> >>>>>>>> b'\x90'.decode('windows-1252') >>>>> UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position >>> 0: >>>>> character maps to <undefined> >>>>> >>>>> In web-1252, the bytes that are undefined according to windows-1252 >>> map to >>>>> the control characters in those positions in iso-8859-1 -- that is, the >>>>> Unicode codepoints with the same number as the byte. In web-1252, >>> b'\x90' >>>>> would decode as '\u0090'. >>>>> >>>>> According to https://en.wikipedia.org/wiki/Windows-1252 , Windows does >>>>> the same: >>>>> >>>>> "According to the information on Microsoft's and the Unicode >>>>> Consortium's websites, positions 81, 8D, 8F, 90, and 9D are unused; >>>>> however, the Windows API MultiByteToWideChar >>>>> < >>> http://msdn.microsoft.com/en-us/library/windows/desktop/dd319072%28v=vs.85%29.aspx >>>> >>>>> maps these to the corresponding C1 control codes >>>>> <https://en.wikipedia.org/wiki/C0_and_C1_control_codes>." >>>>> And in ISO-8859-1, the same handling is done for unused code points >>> even >>>>> by the standard ( https://en.wikipedia.org/wiki/ISO/IEC_8859-1 ) : >>>>> >>>>> "*ISO-8859-1* is the IANA >>>>> <https://en.wikipedia.org/wiki/Internet_Assigned_Numbers_Authority> >>>>> preferred name for this standard when supplemented with the C0 and C1 >>>>> control codes <https://en.wikipedia.org/wiki/C0_and_C1_control_codes> >>>>> from ISO/IEC 6429 <https://en.wikipedia.org/wiki/ISO/IEC_6429>" >>>>> And what would you think -- these "C1 control codes" are also the >>>>> corresponding Unicode points! ( >>>>> https://en.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_block) ) >>>>> >>>>> Since Windows is pretty much the reference implementation for >>>>> "windows-xxxx" encodings, it even makes sense to alter the existing >>>>> encodings rather than add new ones. >>>>> >>>>> >>>>> This may seem like a silly encoding that encourages doing horrible >>> things >>>>> with text. That's pretty much the case. But there's a reason every Web >>>>> browser implements it: >>>>> >>>>> - It's compatible with windows-1252 >>>>> - Any sequence of bytes can be round-tripped through it without losing >>>>> information >>>>> >>>>> It's not just this one encoding. WHATWG's encoding standard ( >>>>> https://encoding.spec.whatwg.org/) contains modified versions of >>>>> windows-1250 through windows-1258 and windows-874. >>>>> >>>>> Support for these encodings matters to me, in part, because I maintain >>> a >>>>> Unicode data-cleaning library, "ftfy". One thing it does is to detect >>> and >>>>> undo encoding/decoding errors that cause mojibake, as long as they're >>>>> detectible and reversible. Looking at real-world examples of text that >>> has >>>>> been damaged by mojibake, it's clear that lots of text is transferred >>>>> through what I'm calling the "web-1252" encoding, in a way that's >>>>> incompatible with Python's "windows-1252". >>>>> >>>>> In order to be able to work with and fix this kind of text, ftfy >>> registers >>>>> new codecs -- and I implemented this even before I knew that they were >>>>> standardized in Web browsers. When ftfy is imported, you can decode >>> text as >>>>> "sloppy-windows-1252" (the name I chose for this encoding), for >>> example. >>>>> >>>>> ftfy can tell people a sequence of steps that they can use in the >>> future >>>>> to fix text that's like the text they provided. Very often, these steps >>>>> require the sloppy-windows-1252 or sloppy-windows-1251 encoding, which >>>>> means the steps only work with ftfy imported, even for people who are >>> not >>>>> using the features of ftfy. >>>>> >>>>> Support for these encodings also seems highly relevant to people who >>> use >>>>> Python for web scraping, as it would be desirable to maximize >>> compatibility >>>>> with what a Web browser would do. >>>>> >>>>> This really seems like it belongs in the standard library instead of >>> being >>>>> an incidental feature of my library. I know that code in the standard >>>>> library has "one foot in the grave". I _want_ these legacy encodings to >>>>> have one foot in the grave. But some of them are extremely common, and >>>>> Python code should be able to deal with them. >>>>> >>>>> Adding these encodings to Python would be straightforward to implement. >>>>> Does this require a PEP, a pull request, or further discussion? >>>>> >>>>> >>>>> _______________________________________________ >>>>> Python-ideas mailing listPython-ideas@python.orghttps:// >>> mail.python.org/mailman/listinfo/python-ideas >>>>> Code of Conduct: http://python.org/psf/codeofconduct/ >>>>> >>>>> >>>>> -- >>>>> Regards, >>>>> Ivan >>>>> >>>>> _______________________________________________ >>>>> Python-ideas mailing list >>>>> Python-ideas@python.org >>>>> https://mail.python.org/mailman/listinfo/python-ideas >>>>> Code of Conduct: http://python.org/psf/codeofconduct/ >>>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> Python-ideas mailing list >>>> Python-ideas@python.org >>>> https://mail.python.org/mailman/listinfo/python-ideas >>>> Code of Conduct: http://python.org/psf/codeofconduct/ >>>> >>> >>> -- >>> Marc-Andre Lemburg >>> eGenix.com >>> >>> Professional Python Services directly from the Experts (#1, Jan 10 2018) >>>>>> Python Projects, Coaching and Consulting ... http://www.egenix.com/ >>>>>> Python Database Interfaces ... http://products.egenix.com/ >>>>>> Plone/Zope Database Interfaces ... http://zope.egenix.com/ >>> ________________________________________________________________________ >>> >>> ::: We implement business ideas - efficiently in both time and costs ::: >>> >>> eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 >>> D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg >>> Registered at Amtsgericht Duesseldorf: HRB 46611 >>> http://www.egenix.com/company/contact/ >>> http://www.malemburg.com/ >>> >> > -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Jan 10 2018) >>> Python Projects, Coaching and Consulting ... http://www.egenix.com/ >>> Python Database Interfaces ... http://products.egenix.com/ >>> Plone/Zope Database Interfaces ... http://zope.egenix.com/ ________________________________________________________________________ ::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/ _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/