Re: GB18030-2022 Support in PostgreSQL

John Naylor Sun, 10 Aug 2025 22:51:28 -0700

On Mon, Aug 11, 2025 at 9:01 AM Chao Li <[email protected]> wrote:
>
> I have created a patch https://commitfest.postgresql.org/patch/5954/. 
> CommitFests requested a rebase, so I rebased the code and created the v2 
> patch.
>
> BTW, I have tested all 66 new characters, 9 not-required characters and 18 
> changed characters in a way as:


"9 characters are no longer required by the new standard, but are
retained in this patch for compatibility"

How is that done?

> I added a test case with a mapping changed char, and the test passes:
>
> % make check
> ...
> # All 229 tests passed.
>
> For more details on the standard change, see 
> https://ken-lunde.medium.com/the-gb-18030-2022-standard-3d0ebaeb4132
>
> I am attaching the patch file.

Going from the old .xml file to the .ucm file makes it difficult to
see the relevant changes. Also, there are nearly 1000 non-user-visible
changes like this in the output file that are not explained:

-  /*** Three byte table, leaf: efa8xx - offset 0x07aba ***/
+  /*** Three byte table, leaf: efa8xx - offset 0x07b3a ***/

The 2000 version is available in the .ucm format, so maybe converting
to that first would be a good preparatory patch:

https://github.com/unicode-org/icu-data/blob/main/charset/data/ucm/gb-18030-2000.ucm

Looking at the history, it looks like that file has seen small
revisions, so it may take some research to get the exact equivalent to
the XML file we use. That will also tell us if anything will change
for us besides the actual 2022 revision.

-- 
John Naylor
Amazon Web Services

Re: GB18030-2022 Support in PostgreSQL

Reply via email to