On Thu, Jan 19, 2023 at 4:47 AM Simon McVittie <s...@debian.org> wrote: > > On Wed, 18 Jan 2023 at 16:30:46 -0700, Anthony Fok wrote: > > In their mind, GB 18030 encompasses a lot more than just > > a character encoding mapping table. It is the full support package > > (including fonts, display, printing, input methods, etc.) for Han > > Chinese and all other minority languages used in China. > > If I'm reading correctly, the character encoding part of GB 18030-2022 is > a subset of a sufficiently new version of Unicode, in the same way that > (say) ISO-8859-15 is a subset of Unicode: for every character representable > in GB 18030-2022, you can point at an equivalent Unicode character and say > "this is the GB 18030-2022 encoding of U+4E00" or similar? Is that true?
If using ISO-8859-15 "legacy encoding" as comparison, in China that would be the 1980 "GB2312" (GB 2312-80) standard and the 1993 "GBK" extension. The character repertoires that these legacy encodings/charsets contain are far fewer than what Unicode or ISO/IEC 10646 encompasses, and in that sense, they are "subsets of Unicode". GB18030, on the other hand, is actually a full UTF or Unicode Transformation Format (i.e. an encoding of *all* Unicode code points), as in GB18030 maps to all codepoints of Unicode while maintaining backward compatibility with existing GB2312 and GBK documents, just like how UTF-8 maps to all codepoints of Unicode while maintaining backward compatibility with ASCII. GB18030 encodes characters into 1-byte, 2-byte or 4-byte sequences. 1-byte essentially ASCII; 2-byte: essentially GBK; the 4-byte sequences give a total of 1,587,600 (126×10×126×10) codepoints which easily and sufficiently cover Unicode's 1,112,064 (17×65536 − 2048 surrogates) assigned, reserved, and noncharacter code points. (source: Wikipedia) Since GB18030 can be used to represent the entirety of all Unicode code points, I would not call GB18030 a "subset" of Unicode. And some people like to think of GB18030 as "UTF-GBK", e.g. http://archives.miloush.net/michkap/archive/2013/03/28/10405914.html > If that's the case, then supporting text files written in GB 18030 > does not *necessarily* require the internal representation or the > system locale to be GB 18030, the same way I can still work with legacy > en_GB.ISO-8859-15 files on my en_GB.UTF-8 system: it could equally well > be done by using iconv() or equivalent to transcode to UTF-8, UTF-16 or > UCS-4 on input, doing all text editing operations on that Unicode, and > then transcoding back into GB 18030 on output. Most language frameworks > already do this as a matter of API: Qt, Java and Windows tend to work > with UTF-16 internally, while GLib/GTK uses UTF-8 internally. Very true. While GB18030 is another encoding form for Unicode (and not a subset), indeed we don't need to use GB18030 as the "internal representation or the system locale", you have put it very nicely. GB18030 is also somewhat inefficient as a UTF as the required mapping table and 4-byte conversion algorithm take up far more space and are quite a bit slower than something as elegant as UTF-8. > iconv() seems very unlikely to drop support for GB 18030, ISO-8859-15 and > other non-Unicode encodings altogether. What this bug report is about is > dropping support for locales whose associated encoding is non-Unicode, > such as en_GB.ISO-8859-15 and zh_CN.GB18030, so that the data stream > between a CLI program and the terminal emulator will be assumed to be UTF-8 > instead of ISO-8859-15 or GB18030. Indeed, and thankfully, Google Chrome, Mozilla Firefox, LibreOffice supposedly still support the reading (and writing) of GB18030 documents through iconv() or ICU or Qt's encoding conversions. > The main thing I can see that would be a problem for GB 18030 users > if the zh_CN.GB18030 locale was dropped is that various programs might > assume that the locale encoding is the right one to assume when loading > existing files and unable to guess the encoding, or the right one to > write into new files by default - and so users who have moved from > zh_CN.GB18030 to zh_CN.UTF-8 might find themselves unintentionally > producing new UTF-8 files. Yes. These are some of the pains as we transition from legacy GB2312/GBK encodings towards Unicode, and GB18030 (being a UTF) is designed as a stepping stone. But yes, moving to UTF-8 is indeed a good thing, even in China, as China is not an isolated island. China people do value interoperability with the world too. > Preferring to use Unicode does seem to be the direction that all of > computing is going in, as a simplifying assumption - for example W3C > advice for HTML is "You should always use the UTF-8 character encoding"[1] > - and as we know, things that aren't tested usually don't work. So I > think the level of functionality for non-UTF-8 locales and encodings in > the software we package is going to decline over time, whether Debian > wants it to or not. Very true, and it is already happening, even in China, thankfully. (See my previous email from today to see my 180° turnaround, as I finally realized that the GB18030 authorities are pragmatic and do not actually require zh_CN.GB18030 to be the system locale, but rather that GB18030 data can be processed; characters that were in PUA but now in Unicode can be properly supported, etc. > smcv > > [1] https://www.w3.org/International/questions/qa-html-encoding-declarations Thank you for the discussion! :-) Cheers, Anthony