Bug#1026231: debian-policy: document droppage of support for legacy locales

Anthony Fok Fri, 20 Jan 2023 08:42:12 -0800

On Thu, Jan 19, 2023 at 4:47 AM Simon McVittie <s...@debian.org> wrote:
>
> On Wed, 18 Jan 2023 at 16:30:46 -0700, Anthony Fok wrote:
> > In their mind, GB 18030 encompasses a lot more than just
> > a character encoding mapping table.  It is the full support package
> > (including fonts, display, printing, input methods, etc.) for Han
> > Chinese and all other minority languages used in China.
>
> If I'm reading correctly, the character encoding part of GB 18030-2022 is
> a subset of a sufficiently new version of Unicode, in the same way that
> (say) ISO-8859-15 is a subset of Unicode: for every character representable
> in GB 18030-2022, you can point at an equivalent Unicode character and say
> "this is the GB 18030-2022 encoding of U+4E00" or similar? Is that true?


If using ISO-8859-15 "legacy encoding" as comparison, in China that
would be the 1980 "GB2312" (GB 2312-80) standard and the 1993 "GBK"
extension.  The character repertoires that these legacy
encodings/charsets contain are far fewer than what Unicode or ISO/IEC
10646 encompasses, and in that sense, they are "subsets of Unicode".

GB18030, on the other hand, is actually a full UTF or Unicode
Transformation Format (i.e. an encoding of *all* Unicode code points),
as in GB18030 maps to all codepoints of Unicode while maintaining
backward compatibility with existing GB2312 and GBK documents, just
like how UTF-8 maps to all codepoints of Unicode while maintaining
backward compatibility with ASCII.

GB18030 encodes characters into 1-byte, 2-byte or 4-byte sequences.
1-byte essentially ASCII; 2-byte: essentially GBK; the 4-byte
sequences give a total of 1,587,600 (126×10×126×10) codepoints which
easily and sufficiently cover Unicode's 1,112,064 (17×65536 − 2048
surrogates) assigned, reserved, and noncharacter code points. (source:
Wikipedia)

Since GB18030 can be used to represent the entirety of all Unicode
code points, I would not call GB18030 a "subset" of Unicode.

And some people like to think of GB18030 as "UTF-GBK", e.g.
http://archives.miloush.net/michkap/archive/2013/03/28/10405914.html

> If that's the case, then supporting text files written in GB 18030
> does not *necessarily* require the internal representation or the
> system locale to be GB 18030, the same way I can still work with legacy
> en_GB.ISO-8859-15 files on my en_GB.UTF-8 system: it could equally well
> be done by using iconv() or equivalent to transcode to UTF-8, UTF-16 or
> UCS-4 on input, doing all text editing operations on that Unicode, and
> then transcoding back into GB 18030 on output. Most language frameworks
> already do this as a matter of API: Qt, Java and Windows tend to work
> with UTF-16 internally, while GLib/GTK uses UTF-8 internally.

Very true.  While GB18030 is another encoding form for Unicode (and
not a subset), indeed we don't need to use GB18030 as the "internal
representation or the system locale", you have put it very nicely.
GB18030 is also somewhat inefficient as a UTF as the required mapping
table and 4-byte conversion algorithm take up far more space and are
quite a bit slower than something as elegant as UTF-8.

> iconv() seems very unlikely to drop support for GB 18030, ISO-8859-15 and
> other non-Unicode encodings altogether. What this bug report is about is
> dropping support for locales whose associated encoding is non-Unicode,
> such as en_GB.ISO-8859-15 and zh_CN.GB18030, so that the data stream
> between a CLI program and the terminal emulator will be assumed to be UTF-8
> instead of ISO-8859-15 or GB18030.

Indeed, and thankfully, Google Chrome, Mozilla Firefox, LibreOffice
supposedly still support the reading (and writing) of GB18030
documents through iconv() or ICU or Qt's encoding conversions.

> The main thing I can see that would be a problem for GB 18030 users
> if the zh_CN.GB18030 locale was dropped is that various programs might
> assume that the locale encoding is the right one to assume when loading
> existing files and unable to guess the encoding, or the right one to
> write into new files by default - and so users who have moved from
> zh_CN.GB18030 to zh_CN.UTF-8 might find themselves unintentionally
> producing new UTF-8 files.

Yes.  These are some of the pains as we transition from legacy
GB2312/GBK encodings towards Unicode, and GB18030 (being a UTF) is
designed as a stepping stone.  But yes, moving to UTF-8 is indeed a
good thing, even in China, as China is not an isolated island.  China
people do value interoperability with the world too.

> Preferring to use Unicode does seem to be the direction that all of
> computing is going in, as a simplifying assumption - for example W3C
> advice for HTML is "You should always use the UTF-8 character encoding"[1]
> - and as we know, things that aren't tested usually don't work. So I
> think the level of functionality for non-UTF-8 locales and encodings in
> the software we package is going to decline over time, whether Debian
> wants it to or not.

Very true, and it is already happening, even in China, thankfully.
(See my previous email from today to see my 180° turnaround, as I
finally realized that the GB18030 authorities are pragmatic and do not
actually require zh_CN.GB18030 to be the system locale, but rather
that GB18030 data can be processed; characters that were in PUA but
now in Unicode can be properly supported, etc.

>     smcv
>
> [1] https://www.w3.org/International/questions/qa-html-encoding-declarations

Thank you for the discussion!  :-)

Cheers,

Anthony

Bug#1026231: debian-policy: document droppage of support for legacy locales

Reply via email to