Re: [Python-ideas] Support WHATWG versions of legacy encodings

Ivan Pozdeev via Python-ideas Tue, 09 Jan 2018 13:52:19 -0800

First of all, many thanks for such a excellently writen letter. It was areal pleasure to read.


On 10.01.2018 0:15, Rob Speer wrote:

Hi! I joined this list because I'm interested in filling a gap inPython's standard library, relating to text encodings.
There is an encoding with no name of its own. It's supported by everycurrent web browser and standardized by WHATWG. It's so prevalent thatif you ask a Web browser to decode "iso-8859-1" or "windows-1252", youwill get this encoding _instead_. It is probably the second or thirdmost common text encoding in the world. And Python doesn't quitesupport it.
You can see the character table for this encoding at:
https://encoding.spec.whatwg.org/index-windows-1252.txt
For the sake of discussion, let's call this encoding "web-1252".WHATWG calls it "windows-1252", but notice that it's subtly differentfrom Python's "windows-1252" encoding. Python's windows-1252 has bytesthat are undefined:
>>> b'\x90'.decode('windows-1252')
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position0: character maps to <undefined>
In web-1252, the bytes that are undefined according to windows-1252map to the control characters in those positions in iso-8859-1 -- thatis, the Unicode codepoints with the same number as the byte. Inweb-1252, b'\x90' would decode as '\u0090'.

According to https://en.wikipedia.org/wiki/Windows-1252 , Windows doesthe same:

"According to the information on Microsoft's and the UnicodeConsortium's websites, positions 81, 8D, 8F, 90, and 9D are unused;however, the Windows API |MultiByteToWideChar<http://msdn.microsoft.com/en-us/library/windows/desktop/dd319072%28v=vs.85%29.aspx>|maps these to the corresponding C1 control codes<https://en.wikipedia.org/wiki/C0_and_C1_control_codes>."

And in ISO-8859-1, the same handling is done for unused code points evenby the standard ( https://en.wikipedia.org/wiki/ISO/IEC_8859-1 ) :

"*ISO-8859-1* is the IANA<https://en.wikipedia.org/wiki/Internet_Assigned_Numbers_Authority>preferred name for this standard when supplemented with the C0 and C1control codes <https://en.wikipedia.org/wiki/C0_and_C1_control_codes>from ISO/IEC 6429 <https://en.wikipedia.org/wiki/ISO/IEC_6429>"

And what would you think -- these "C1 control codes" are also thecorresponding Unicode points! (https://en.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_block)<https://en.wikipedia.org/wiki/Latin-1_Supplement_%28Unicode_block%29> )

Since Windows is pretty much the reference implementation for"windows-xxxx" encodings, it even makes sense to alter the existingencodings rather than add new ones.

This may seem like a silly encoding that encourages doing horriblethings with text. That's pretty much the case. But there's a reasonevery Web browser implements it:
- It's compatible with windows-1252
- Any sequence of bytes can be round-tripped through it without losinginformation
It's not just this one encoding. WHATWG's encoding standard(https://encoding.spec.whatwg.org/) contains modified versions ofwindows-1250 through windows-1258 and windows-874.
Support for these encodings matters to me, in part, because I maintaina Unicode data-cleaning library, "ftfy". One thing it does is todetect and undo encoding/decoding errors that cause mojibake, as longas they're detectible and reversible. Looking at real-world examplesof text that has been damaged by mojibake, it's clear that lots oftext is transferred through what I'm calling the "web-1252" encoding,in a way that's incompatible with Python's "windows-1252".
In order to be able to work with and fix this kind of text, ftfyregisters new codecs -- and I implemented this even before I knew thatthey were standardized in Web browsers. When ftfy is imported, you candecode text as "sloppy-windows-1252" (the name I chose for thisencoding), for example.
ftfy can tell people a sequence of steps that they can use in thefuture to fix text that's like the text they provided. Very often,these steps require the sloppy-windows-1252 or sloppy-windows-1251encoding, which means the steps only work with ftfy imported, even forpeople who are not using the features of ftfy.
Support for these encodings also seems highly relevant to people whouse Python for web scraping, as it would be desirable to maximizecompatibility with what a Web browser would do.
This really seems like it belongs in the standard library instead ofbeing an incidental feature of my library. I know that code in thestandard library has "one foot in the grave". I _want_ these legacyencodings to have one foot in the grave. But some of them areextremely common, and Python code should be able to deal with them.
Adding these encodings to Python would be straightforward toimplement. Does this require a PEP, a pull request, or further discussion?
_______________________________________________
Python-ideas mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


--
Regards,
Ivan

_______________________________________________
Python-ideas mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] Support WHATWG versions of legacy encodings

Reply via email to