On Jan 11, 2018 4:05 AM, "Antoine Pitrou" <solip...@pitrou.net> wrote:
Define "widely used". If web-XXX is a superset of windows-XXX, then perhaps web-XXX is "used" in the sense of "used to decode valid windows-XXX data" (but windows-XXX could be used just as well to decode the same data). The question is rather: how often does web-XXX mojibake happen? We're well in the 2010s now and you'd hope that mojibake doesn't happen as often as it used to in, e.g., 1998. I'm not an expert here or anything, but from what we've been hearing it sounds like it must be used by all standard-compliant HTML parsers. I don't *like* the standard much, but I don't think that the stdlib should refuse to handle standard-compliant HTML, or help users handle standard-compliant HTML correctly, just because the HTML standard has unfortunate things in it. We're not going to convince them to change the standard or anything. And this whole thread started with someone said that their mojibake fixing library is having trouble because of this, so clearly mojibake does still exist. Does it help if we reframe it as not that whatwg is "wrong" about windows-1252, but rather that there is this encoding web-1252, and thanks to an interesting quirk of history, in HTML documents the byte sequence b'<meta charset="windows-1252">' indicates a file using this encoding? In fact the mapping between byte sequences and character sets here is so arbitrary that in standards-compliant HTML, the byte sequences b'<meta charset="ascii">', b'<meta charset="iso-8859-1">', and b'<meta charset="latin1">' *also* indicate that the file is encoded using web-1252. (See: https://encoding.spec.whatwg.org/#names-and-labels) -n
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/