On 10/01/2018 19:13, Rob Speer wrote: > I was originally proposing these encodings under different names, and > that's what I think they should have. Indeed, that helps because a pip > installable library can backport the new encodings to previous versions > of Python. > > Having a pip installable library as the _only_ way to use these > encodings is the status quo that I am very familiar with. It's awkward. > To use a package that registers new codecs, you have to import something > from that package, even if you never call anything from what you > imported, and that makes flake8 complain. The idea that an encoding name > may or may not be registered, based on what has been imported, breaks > our intuition about reading Python code and is very hard to statically > analyze. > > I disagree with calling the WHATWG encodings that are implemented in > every Web browser "non-standard". WHATWG may not have a typical origin > story as a standards organization, but it _is_ the standards > organization for the Web.
Please note that the WHATWG standard describes Windows-1252 as a "Legacy Single Byte Encoding" and to me the name suggests expects it to be implemented on Windows platforms and for Windows Specific Web Pages. THE Encoding - i.e. the standard that all browsers, and other web applications, are expected to adhere to is UTF-8. I am somewhat confused because according to https://encoding.spec.whatwg.org/index-windows-1252.txt 0x90 (one of the original examples) is undefined as the table only runs to 127 i.e. 0x7F. > > I'm really not interested in best-fit mappings that turn infinity into > "8" and square roots into "v". Making weird mappings like that sounds > like a job for the "unidecode" library, not the stdlib. > > On Wed, 10 Jan 2018 at 13:36 Rob Speer <rsp...@luminoso.com > <mailto:rsp...@luminoso.com>> wrote: > > I'm looking at the documentation of "best fit" mappings, and that > seems to be a different matter. It appears that best-fit mappings > are designed to be many-to-one mappings used only for encoding. > > "Examples of best fit are converting fullwidth letters to their > counterparts when converting to single byte code pages, and mapping > the Infinity character to the number 8." (Mapping ∞ to 8? > Seriously?!) It also does things such as mapping Cyrillic letters to > Latin letters that look like them. > > This is not what I'm interested in implementing. I just want there > to be encodings that match the WHATWG encodings exactly. If they > have to be given a different name, that's fine with me. > > On Wed, 10 Jan 2018 at 03:38 M.-A. Lemburg <m...@egenix.com > <mailto:m...@egenix.com>> wrote: > > On 10.01.2018 00:56, Rob Speer wrote: > > Oh that's interesting. So it seems to be Python that's the > exception here. > > > > Would we really be able to add entries to character mappings > that haven't > > changed since Python 2.0? > > The Windows mappings in Python come directly from the Unicode > Consortium mapping files. > > If the Consortium changes the mappings, we can update them. > > If not, then we have a problem, since consumers are not only > the win32 APIs, but also other tools out there running on > completely different platforms, e.g. Java tools or web servers > providing downloads using the Windows code page encodings. > > Allowing such mappings in the existing codecs would then result > failures when the "other" sides see the decoded Unicode version and > try to encode back into the original encoding - you'd move the > problem from the Python side to the "other" side of the > integration. > > I had a look on the Unicode FTP site and they have since added > a new directory with mapping files they call "best fit": > > > http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/readme.txt > > <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.unicode.org%2FPublic%2FMAPPINGS%2FVENDORS%2FMICSFT%2FWindowsBestFit%2Freadme.txt&data=02%7C01%7C%7Cb2d01d06a38b43192b4308d5585e5542%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636512084531813598&sdata=pn8U1DXOag2w4v%2BLWQYvj52CPyFMQAA6hleOHNJb7Qg%3D&reserved=0> > > The WideCharToMultiByte() defaults to best fit, but also offers > a mode where it operates in standards compliant mode: > > > https://msdn.microsoft.com/en-us/library/windows/desktop/dd374130%28v=vs.85%29.aspx > > <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmsdn.microsoft.com%2Fen-us%2Flibrary%2Fwindows%2Fdesktop%2Fdd374130%2528v%3Dvs.85%2529.aspx&data=02%7C01%7C%7Cb2d01d06a38b43192b4308d5585e5542%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636512084531813598&sdata=Vk4Ta9wswCUTJ39VKxdRfpaP9GYTtqht2NAY%2BXIsUdY%3D&reserved=0> > > See flag WC_NO_BEST_FIT_CHARS. > > Unicode TR#22 is also clear on this: > > > https://www.unicode.org/reports/tr22/tr22-3.html#Illegal_and_Unassigned > > <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.unicode.org%2Freports%2Ftr22%2Ftr22-3.html%23Illegal_and_Unassigned&data=02%7C01%7C%7Cb2d01d06a38b43192b4308d5585e5542%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636512084531813598&sdata=7BtwAMC5H%2BPDH6iun6aMpwvSAl2ZlMKm%2F97MNP8%2FB2c%3D&reserved=0> > > It allows such best fit mappings to make encodings round-trip > safe, but requires to keep these separate from the original > standard mappings: > > """ > It is very important that systems be able to distinguish between the > fallback mappings and regular mappings. Systems like XML require > the use > of hex escape sequences (NCRs) to preserve round-trip integrity; > use of > fallback characters in that case corrupts the data. > """ > > If you read the above section in TR#22 you quickly get reminded > of what the Unicode error handlers do (we basically implement > the three modes it mentions... raise, ignore, replace). > > Now, for unmapped sequences an error handler can opt for > using a fallback sequence instead. > > So in addition to adding best fit codecs, there's also the > option to add an error handler for best fit resolution of > unmapped sequences. > > Given the above, I don't think we ought to change the existing > standards compliant mappings, but use one of two solutions: > > a) add "best fit" encodings (see the Unicode FTP site for > a list) > > b) add an error handlers "bestfit" which implements the > fallback modes for the encodings in question > > > > On Tue, 9 Jan 2018 at 16:53 Ivan Pozdeev via Python-ideas < > > python-ideas@python.org <mailto:python-ideas@python.org>> wrote: > > > >> First of all, many thanks for such a excellently writen > letter. It was a > >> real pleasure to read. > >> On 10.01.2018 0:15, Rob Speer wrote: > >> > >> Hi! I joined this list because I'm interested in filling a > gap in Python's > >> standard library, relating to text encodings. > >> > >> There is an encoding with no name of its own. It's supported > by every > >> current web browser and standardized by WHATWG. It's so > prevalent that if > >> you ask a Web browser to decode "iso-8859-1" or > "windows-1252", you will > >> get this encoding _instead_. It is probably the second or > third most common > >> text encoding in the world. And Python doesn't quite support it. > >> > >> You can see the character table for this encoding at: > >> https://encoding.spec.whatwg.org/index-windows-1252.txt > > <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fencoding.spec.whatwg.org%2Findex-windows-1252.txt&data=02%7C01%7C%7Cb2d01d06a38b43192b4308d5585e5542%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636512084531813598&sdata=DRenmkuzqskrscXkOnLQSeKQeEF25eg9jvSbUZ3XM3I%3D&reserved=0> > >> > >> For the sake of discussion, let's call this encoding > "web-1252". WHATWG > >> calls it "windows-1252", but notice that it's subtly > different from > >> Python's "windows-1252" encoding. Python's windows-1252 has > bytes that are > >> undefined: > >> > >>>>> b'\x90'.decode('windows-1252') > >> UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 > in position 0: > >> character maps to <undefined> > >> > >> In web-1252, the bytes that are undefined according to > windows-1252 map to > >> the control characters in those positions in iso-8859-1 -- > that is, the > >> Unicode codepoints with the same number as the byte. In > web-1252, b'\x90' > >> would decode as '\u0090'. > >> > >> According to https://en.wikipedia.org/wiki/Windows-1252 > > <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FWindows-1252&data=02%7C01%7C%7Cb2d01d06a38b43192b4308d5585e5542%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636512084531813598&sdata=bZaB6dSY8wVy8TnQ75i0SRtyHF2XiH3bfRcs1JQr%2BZQ%3D&reserved=0> > , Windows does > >> the same: > >> > >> "According to the information on Microsoft's and the Unicode > >> Consortium's websites, positions 81, 8D, 8F, 90, and 9D are > unused; > >> however, the Windows API MultiByteToWideChar > >> > > <http://msdn.microsoft.com/en-us/library/windows/desktop/dd319072%28v=vs.85%29.aspx > > <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fmsdn.microsoft.com%2Fen-us%2Flibrary%2Fwindows%2Fdesktop%2Fdd319072%2528v%3Dvs.85%2529.aspx&data=02%7C01%7C%7Cb2d01d06a38b43192b4308d5585e5542%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636512084531813598&sdata=RPpM7UWhZnAA%2FggB6qXzI3fgsPK03DD4logfqoNCcK0%3D&reserved=0>> > >> maps these to the corresponding C1 control codes > >> <https://en.wikipedia.org/wiki/C0_and_C1_control_codes > > <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FC0_and_C1_control_codes&data=02%7C01%7C%7Cb2d01d06a38b43192b4308d5585e5542%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636512084531813598&sdata=khZeJxsbNKIuaKmwHLpVH5g8mFbhyDf7I2dXzvCNA60%3D&reserved=0>>." > >> And in ISO-8859-1, the same handling is done for unused code > points even > >> by the standard ( > https://en.wikipedia.org/wiki/ISO/IEC_8859-1 > > <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FISO%2FIEC_8859-1&data=02%7C01%7C%7Cb2d01d06a38b43192b4308d5585e5542%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636512084531813598&sdata=m5XnIP%2FAr2vZEZWZgSX%2F2UFLRAa4SbN7dHp4kvHDzQI%3D&reserved=0> > ) : > >> > >> "*ISO-8859-1* is the IANA > >> > <https://en.wikipedia.org/wiki/Internet_Assigned_Numbers_Authority > > <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FInternet_Assigned_Numbers_Authority&data=02%7C01%7C%7Cb2d01d06a38b43192b4308d5585e5542%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636512084531813598&sdata=HVS%2FJCAqcUqyrNBuxQqy9LSmdaXqh8TtYYXxwE12wh8%3D&reserved=0>> > >> preferred name for this standard when supplemented with the > C0 and C1 > >> control codes > <https://en.wikipedia.org/wiki/C0_and_C1_control_codes > > <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FC0_and_C1_control_codes&data=02%7C01%7C%7Cb2d01d06a38b43192b4308d5585e5542%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636512084531813598&sdata=khZeJxsbNKIuaKmwHLpVH5g8mFbhyDf7I2dXzvCNA60%3D&reserved=0>> > >> from ISO/IEC 6429 > <https://en.wikipedia.org/wiki/ISO/IEC_6429 > > <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FISO%2FIEC_6429&data=02%7C01%7C%7Cb2d01d06a38b43192b4308d5585e5542%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636512084531813598&sdata=QyF0XyG%2BKXQvfRuIKSwlLSgQ1WIjmrUgxDsVto7oqqA%3D&reserved=0>>" > >> And what would you think -- these "C1 control codes" are > also the > >> corresponding Unicode points! ( > >> > https://en.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_block) > > <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FLatin-1_Supplement_(Unicode_block)&data=02%7C01%7C%7Cb2d01d06a38b43192b4308d5585e5542%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636512084531813598&sdata=1Hne6kIyR2KBDxnxzJS9pc4Mra01W6aL1mIayUBWiRE%3D&reserved=0> > ) > >> > >> Since Windows is pretty much the reference implementation for > >> "windows-xxxx" encodings, it even makes sense to alter the > existing > >> encodings rather than add new ones. > >> > >> > >> This may seem like a silly encoding that encourages doing > horrible things > >> with text. That's pretty much the case. But there's a reason > every Web > >> browser implements it: > >> > >> - It's compatible with windows-1252 > >> - Any sequence of bytes can be round-tripped through it > without losing > >> information > >> > >> It's not just this one encoding. WHATWG's encoding standard ( > >> https://encoding.spec.whatwg.org/ > > <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fencoding.spec.whatwg.org%2F&data=02%7C01%7C%7Cb2d01d06a38b43192b4308d5585e5542%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636512084531813598&sdata=pIltxhrJqWWET90I3YB0WRw7LhSTfpJ6dUA9oaNV7Eo%3D&reserved=0>) > contains modified versions of > >> windows-1250 through windows-1258 and windows-874. > >> > >> Support for these encodings matters to me, in part, because > I maintain a > >> Unicode data-cleaning library, "ftfy". One thing it does is > to detect and > >> undo encoding/decoding errors that cause mojibake, as long > as they're > >> detectible and reversible. Looking at real-world examples of > text that has > >> been damaged by mojibake, it's clear that lots of text is > transferred > >> through what I'm calling the "web-1252" encoding, in a way > that's > >> incompatible with Python's "windows-1252". > >> > >> In order to be able to work with and fix this kind of text, > ftfy registers > >> new codecs -- and I implemented this even before I knew that > they were > >> standardized in Web browsers. When ftfy is imported, you can > decode text as > >> "sloppy-windows-1252" (the name I chose for this encoding), > for example. > >> > >> ftfy can tell people a sequence of steps that they can use > in the future > >> to fix text that's like the text they provided. Very often, > these steps > >> require the sloppy-windows-1252 or sloppy-windows-1251 > encoding, which > >> means the steps only work with ftfy imported, even for > people who are not > >> using the features of ftfy. > >> > >> Support for these encodings also seems highly relevant to > people who use > >> Python for web scraping, as it would be desirable to > maximize compatibility > >> with what a Web browser would do. > >> > >> This really seems like it belongs in the standard library > instead of being > >> an incidental feature of my library. I know that code in the > standard > >> library has "one foot in the grave". I _want_ these legacy > encodings to > >> have one foot in the grave. But some of them are extremely > common, and > >> Python code should be able to deal with them. > >> > >> Adding these encodings to Python would be straightforward to > implement. > >> Does this require a PEP, a pull request, or further discussion? > >> > >> > >> _______________________________________________ > >> Python-ideas mailing > > listPython-ideas@python.orghttps://mail.python.org/mailman/listinfo/python-ideas > > <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fmail.python.org%2Fmailman%2Flistinfo%2Fpython-ideas&data=02%7C01%7C%7Cb2d01d06a38b43192b4308d5585e5542%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636512084531813598&sdata=AhPXPrT3NaO9BlKLUAH%2Fw7Pw%2FTuG9cNU1qVY0ahmTlM%3D&reserved=0> > >> Code of Conduct: http://python.org/psf/codeofconduct/ > > <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpython.org%2Fpsf%2Fcodeofconduct%2F&data=02%7C01%7C%7Cb2d01d06a38b43192b4308d5585e5542%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636512084531813598&sdata=ynnrEBP0NvRs5fxsS%2F7RDf%2B6Lzm3mZH%2BMOtZ4qi9TKA%3D&reserved=0> > >> > >> > >> -- > >> Regards, > >> Ivan > >> > >> _______________________________________________ > >> Python-ideas mailing list > >> Python-ideas@python.org <mailto:Python-ideas@python.org> > >> https://mail.python.org/mailman/listinfo/python-ideas > > <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.python.org%2Fmailman%2Flistinfo%2Fpython-ideas&data=02%7C01%7C%7Cb2d01d06a38b43192b4308d5585e5542%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636512084531813598&sdata=88f%2Fr0E7x2mHJtETG7LEKZd4mARlCvgGIbhmFmZnmcQ%3D&reserved=0> > >> Code of Conduct: http://python.org/psf/codeofconduct/ > > <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpython.org%2Fpsf%2Fcodeofconduct%2F&data=02%7C01%7C%7Cb2d01d06a38b43192b4308d5585e5542%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636512084531813598&sdata=ynnrEBP0NvRs5fxsS%2F7RDf%2B6Lzm3mZH%2BMOtZ4qi9TKA%3D&reserved=0> > >> > > > > > > > > _______________________________________________ > > Python-ideas mailing list > > Python-ideas@python.org <mailto:Python-ideas@python.org> > > https://mail.python.org/mailman/listinfo/python-ideas > > <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.python.org%2Fmailman%2Flistinfo%2Fpython-ideas&data=02%7C01%7C%7Cb2d01d06a38b43192b4308d5585e5542%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636512084531813598&sdata=88f%2Fr0E7x2mHJtETG7LEKZd4mARlCvgGIbhmFmZnmcQ%3D&reserved=0> > > Code of Conduct: http://python.org/psf/codeofconduct/ > > <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpython.org%2Fpsf%2Fcodeofconduct%2F&data=02%7C01%7C%7Cb2d01d06a38b43192b4308d5585e5542%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636512084531813598&sdata=ynnrEBP0NvRs5fxsS%2F7RDf%2B6Lzm3mZH%2BMOtZ4qi9TKA%3D&reserved=0> > > > > -- > Marc-Andre Lemburg > eGenix.com > > Professional Python Services directly from the Experts (#1, Jan > 10 2018) > >>> Python Projects, Coaching and Consulting ... > http://www.egenix.com/ > > <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.egenix.com%2F&data=02%7C01%7C%7Cb2d01d06a38b43192b4308d5585e5542%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636512084531813598&sdata=dyDZJ4zqiwg%2BK38qw3j5IfNcN8Mnb4Y7eNUij7ehlZ8%3D&reserved=0> > >>> Python Database Interfaces ... http://products.egenix.com/ > > <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fproducts.egenix.com%2F&data=02%7C01%7C%7Cb2d01d06a38b43192b4308d5585e5542%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636512084531813598&sdata=T44vL466%2BWhFpLENeWawtqjsTrtOF2bjoSyvsIJzG%2FA%3D&reserved=0> > >>> Plone/Zope Database Interfaces ... http://zope.egenix.com/ > > <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fzope.egenix.com%2F&data=02%7C01%7C%7Cb2d01d06a38b43192b4308d5585e5542%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636512084531813598&sdata=kjSu82ifY%2BA9xD761whXay4pSvVfJ%2FSvb7m%2FBYE9iXM%3D&reserved=0> > > ________________________________________________________________________ > > ::: We implement business ideas - efficiently in both time and > costs ::: > > eGenix.com Software, Skills and Services GmbH > Pastor-Loeh-Str.48 > D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg > Registered at Amtsgericht Duesseldorf: HRB 46611 > http://www.egenix.com/company/contact/ > > <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.egenix.com%2Fcompany%2Fcontact%2F&data=02%7C01%7C%7Cb2d01d06a38b43192b4308d5585e5542%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636512084531813598&sdata=2YkSya%2BsIzXqlbZU9CZsVLIt6qJlYigeoZYyDAOK1x0%3D&reserved=0> > http://www.malemburg.com/ > > <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.malemburg.com%2F&data=02%7C01%7C%7Cb2d01d06a38b43192b4308d5585e5542%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636512084531969851&sdata=JkK6rZQ%2BE%2FZhuxOaoSjsjwj0e%2F%2FmLG16nQ0ELG1Kg2s%3D&reserved=0> > > > > _______________________________________________ > Python-ideas mailing list > Python-ideas@python.org > https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.python.org%2Fmailman%2Flistinfo%2Fpython-ideas&data=02%7C01%7C%7Cb2d01d06a38b43192b4308d5585e5542%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636512084531969851&sdata=Ruay67LA%2Fyv3Ki5jevX7qbBtaw1PfG6I5c00kFZzxNY%3D&reserved=0 > Code of Conduct: > https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpython.org%2Fpsf%2Fcodeofconduct%2F&data=02%7C01%7C%7Cb2d01d06a38b43192b4308d5585e5542%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636512084531969851&sdata=Dfsa4ryYvzqKJFN9FtbuQPJ9T6mlArpkL0Z%2BwzAGGTg%3D&reserved=0 > -- Steve (Gadget) Barnes Any opinions in this message are my personal opinions and do not reflect those of my employer. _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/