Re: Putting Unicode characters in JSON

Richard Damon Fri, 23 Mar 2018 07:16:10 -0700

On 3/23/18 6:35 AM, Chris Angelico wrote:

On Fri, Mar 23, 2018 at 9:29 PM, Steven D'Aprano
<steve+comp.lang.pyt...@pearwood.info> wrote:

On Fri, 23 Mar 2018 18:35:20 +1100, Chris Angelico wrote:

That doesn't seem to be a strictly-correct Latin-1 decoder, then. There
are a number of unassigned byte values in ISO-8859-1.

That's incorrect, but I don't blame you for getting it wrong. Who thought
that it was a good idea to distinguish between "ISO 8859-1" and
"ISO-8859-1" as two related but distinct encodings?

https://en.wikipedia.org/wiki/ISO/IEC_8859-1

The old ISO 8859-1 standard, the one with undefined values, is mostly of
historical interest. For the last twenty years or so, anyone talking
about either Latin-1 or ISO-8859-1 (with or without dashes) is almost
meaning the 1992 IANA superset version which defines all 256 characters:

     "In 1992, the IANA registered the character map ISO_8859-1:1987,
     more commonly known by its preferred MIME name of ISO-8859-1
     (note the extra hyphen over ISO 8859-1), a superset of ISO
     8859-1, for use on the Internet. This map assigns the C0 and C1
     control characters to the unassigned code values thus provides
     for 256 characters via every possible 8-bit value."


Either that, or they actually mean Windows-1252, but let's not go there.

Wait, whaaa.......

Though in my own defense, MySQL itself seems to have a bit of a
problem with encoding names. Its "utf8" is actually "UTF-8 with a
maximum of three bytes per character", in contrast to "utf8mb4" which
is, well, UTF-8.

In any case, abusing "Latin-1" to store binary data is still wrong.
That's what BLOB is for.

ChrisA

One comment on this whole argument, the original poster asked how to getdata from a database that WAS using Latin-1 encoding into JSON (whichwants UTF-8 encoding) and was asking if something needed to be donebeyond using .decode('Latin-1'), and in particular if they need to use a.encode('UTF-8'). The answer should be a simple Yes or No.

Instead, someone took the opportunity to advocate that a wholesalechange to the database was the only reasonable course of action.

First comment, when someone is proposing a change, it is generally puton them the burden of proof that the change is warranted. This isespecially true when they are telling someone else they need to makesuch a change.

Absolute statements are very hard to prove (but the difficulty of proofdoesn't relieve the need to provide it), and in fact are fairly easy todisprove (one counter example disproves an absolute statement). Counterexamples to the absolute statement have been provided.

When dealing with a code base, backwards compatibility IS important, andcasually changing something that fundamental isn't the first thing thatsomeone should be thinking about, We weren't given any details about theoverall system this was part of, and they easily could be other codeusing the database that such a change would break. One easy Pythonexample is to look back at the change from Python 2 to Python 3, howmany years has that gone on, and how many more will people continue todeal with it? This was over a similar issue, that at least for today,Unicode is the best solution for storing arbitrary text, and forcingthat change down to the fundamental level.


--
Richard Damon

--
https://mail.python.org/mailman/listinfo/python-list

Re: Putting Unicode characters in JSON

Reply via email to