On 3/23/18 6:35 AM, Chris Angelico wrote:
On Fri, Mar 23, 2018 at 9:29 PM, Steven D'Aprano
<steve+comp.lang.pyt...@pearwood.info> wrote:
On Fri, 23 Mar 2018 18:35:20 +1100, Chris Angelico wrote:

That doesn't seem to be a strictly-correct Latin-1 decoder, then. There
are a number of unassigned byte values in ISO-8859-1.
That's incorrect, but I don't blame you for getting it wrong. Who thought
that it was a good idea to distinguish between "ISO 8859-1" and
"ISO-8859-1" as two related but distinct encodings?

https://en.wikipedia.org/wiki/ISO/IEC_8859-1

The old ISO 8859-1 standard, the one with undefined values, is mostly of
historical interest. For the last twenty years or so, anyone talking
about either Latin-1 or ISO-8859-1 (with or without dashes) is almost
meaning the 1992 IANA superset version which defines all 256 characters:

     "In 1992, the IANA registered the character map ISO_8859-1:1987,
     more commonly known by its preferred MIME name of ISO-8859-1
     (note the extra hyphen over ISO 8859-1), a superset of ISO
     8859-1, for use on the Internet. This map assigns the C0 and C1
     control characters to the unassigned code values thus provides
     for 256 characters via every possible 8-bit value."


Either that, or they actually mean Windows-1252, but let's not go there.

Wait, whaaa.......

Though in my own defense, MySQL itself seems to have a bit of a
problem with encoding names. Its "utf8" is actually "UTF-8 with a
maximum of three bytes per character", in contrast to "utf8mb4" which
is, well, UTF-8.

In any case, abusing "Latin-1" to store binary data is still wrong.
That's what BLOB is for.

ChrisA

One comment on this whole argument, the original poster asked how to get data from a database that WAS using Latin-1 encoding into JSON (which wants UTF-8 encoding) and was asking if something needed to be done beyond using .decode('Latin-1'), and in particular if they need to use a .encode('UTF-8'). The answer should be a simple Yes or No.

Instead, someone took the opportunity to advocate that a wholesale change to the database was the only reasonable course of action.

First comment, when someone is proposing a change, it is generally put on them the burden of proof that the change is warranted. This is especially true when they are telling someone else they need to make such a change.

Absolute statements are very hard to prove (but the difficulty of proof doesn't relieve the need to provide it), and in fact are fairly easy to disprove (one counter example disproves an absolute statement). Counter examples to the absolute statement have been provided.

When dealing with a code base, backwards compatibility IS important, and casually changing something that fundamental isn't the first thing that someone should be thinking about, We weren't given any details about the overall system this was part of, and they easily could be other code using the database that such a change would break. One easy Python example is to look back at the change from Python 2 to Python 3, how many years has that gone on, and how many more will people continue to deal with it? This was over a similar issue, that at least for today, Unicode is the best solution for storing arbitrary text, and forcing that change down to the fundamental level.

--
Richard Damon

--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to