Re: GB18030-2022 Support in PostgreSQL

Chao Li Mon, 11 Aug 2025 01:22:45 -0700

Hi John,

Thanks for your review.

Yes, I did a diff between 2000.ucm and 2022.ucm when I worked on the patch. The 
diff between 2000.ucm and 2022.ucm are quite small:

```diff - omit the comment part
> <U20AC> \x80 |3
> <U3000> \xA3\xA0 |3
> <UE5E5> \xA3\xA0 |4
>
28067a28099,28114
> <U9FB4> \xFE\x59 |0
> <U9FB4> \x82\x35\x90\x37 |3
> <U9FB5> \xFE\x61 |0
> <U9FB5> \x82\x35\x90\x38 |3
> <U9FB6> \xFE\x66 |0
> <U9FB6> \x82\x35\x90\x39 |3
> <U9FB7> \xFE\x67 |0
> <U9FB7> \x82\x35\x91\x30 |3
> <U9FB8> \xFE\x6D |0
> <U9FB8> \x82\x35\x91\x31 |3
> <U9FB9> \xFE\x7E |0
> <U9FB9> \x82\x35\x91\x32 |3
> <U9FBA> \xFE\x90 |0
> <U9FBA> \x82\x35\x91\x33 |3
> <U9FBB> \xFE\xA0 |0
> <U9FBB> \x82\x35\x91\x34 |3
29577c29624
< <UE5E5> \xA3\xA0 |0
---
> # <UE5E5> \xA3\xA0 |0
30001,30010c30048,30057
< <UE78D> \xA6\xD9 |0
< <UE78E> \xA6\xDA |0
< <UE78F> \xA6\xDB |0
< <UE790> \xA6\xDC |0
< <UE791> \xA6\xDD |0
< <UE792> \xA6\xDE |0
< <UE793> \xA6\xDF |0
< <UE794> \xA6\xEC |0
< <UE795> \xA6\xED |0
< <UE796> \xA6\xF3 |0
---
> <UE78D> \xA6\xD9 |1
> <UE78E> \xA6\xDA |1
> <UE78F> \xA6\xDB |1
> <UE790> \xA6\xDC |1
> <UE791> \xA6\xDD |1
> <UE792> \xA6\xDE |1
> <UE793> \xA6\xDF |1
> <UE794> \xA6\xEC |1
> <UE795> \xA6\xED |1
> <UE796> \xA6\xF3 |1
30146c30193
< <UE81E> \xFE\x59 |0
---
> <UE81E> \xFE\x59 |1
30154c30201
< <UE826> \xFE\x61 |0
---
> <UE826> \xFE\x61 |1
30159,30160c30206,30207
< <UE82B> \xFE\x66 |0
< <UE82C> \xFE\x67 |0
---
> <UE82B> \xFE\x66 |1
> <UE82C> \xFE\x67 |1
30166c30213
< <UE832> \xFE\x6D |0
---
> <UE832> \xFE\x6D |1
30183c30230
< <UE843> \xFE\x7E |0
---
> <UE843> \xFE\x7E |1
30200c30247
< <UE854> \xFE\x90 |0
---
> <UE854> \xFE\x90 |1
30216c30263
< <UE864> \xFE\xA0 |0
---
> <UE864> \xFE\xA0 |1
30470a30518,30537
> <UFE10> \xA6\xD9 |0
> <UFE10> \x84\x31\x82\x36 |3
> <UFE11> \xA6\xDB |0
> <UFE11> \x84\x31\x82\x37 |3
> <UFE12> \xA6\xDA |0
> <UFE12> \x84\x31\x82\x38 |3
> <UFE13> \xA6\xDC |0
> <UFE13> \x84\x31\x82\x39 |3
> <UFE14> \xA6\xDD |0
> <UFE14> \x84\x31\x83\x30 |3
> <UFE15> \xA6\xDE |0
> <UFE15> \x84\x31\x83\x31 |3
> <UFE16> \xA6\xDF |0
> <UFE16> \x84\x31\x83\x32 |3
> <UFE17> \xA6\xEC |0
> <UFE17> \x84\x31\x83\x33 |3
> <UFE18> \xA6\xED |0
> <UFE18> \x84\x31\x83\x34 |3
> <UFE19> \xA6\xF3 |0
> <UFE19> \x84\x31\x83\x35 |3
```

As you can see, the changes only reflect to the changed 18 characters plus 
other 3 unicode points (U20AC, U3000, UE5E5). My code comment in 
UCS_to_GB18030.pl has explained these changes:

```code comment from UCS_to_GB18030.pl
# The |n is a flag, where n has values of 0, 1, 3, 4.
# With a refeence to 
https://ken-lunde.medium.com/the-gb-18030-2022-standard-3d0ebaeb4132,
# the flag should mean the following:
#   0 - round-trip mapping
#   1 - there are 18 mappings with flag 1, those are mapping changes
#       from GB180303-2000 to GB18030-2022. Old mappings are marked
#       with flag 1, new mappings with flag 0. So we can ignore all
#       mappings with flag 0.
#   3 - there are 20 mappings with flag 3:
#         18 of them reflect to the 18 mappings with flag 1, but means
#       the old mapping's unicode's new mapping with GB18030-2022.
#       These 18 new mappings have no actual glyphs in GB18030-2022.
#       So we can ignore these 18 mappings with flag 3.
#         The other 2 are: "<U20AC> \x80 |3" and "<U3000> \xA3\xA0 |3".
#       They are two reserved fallbacks for compatibility with GBK and
#       other web data as in WHATWG. Both U20AC and U3000 have round-
#       trip mappings in GB18030-2022, so we can ignore these two
#       mappings with flag 3.
#         So, we can ignore all mappings with flag 3.
#   4 - there is only one mapping with flag 4: <UE5E5> \xA3\xA0 |4.
#       This is a "good one-way" mapping from U+E5E5 to \xA3\xA0
#       for maximum compatibility with previous behavior. So we can
#       ignore this mapping as well.
```

For your question:

> "9 characters are no longer required by the new standard, but are
> retained in this patch for compatibility"
> 
> How is that done?

The 9 mappings are not changed between 2000.ucm and 2022.ucm. For example, 
GB18030 code 0xFD9C is one of the 9 not-required code, but the mapping:

<UF92C> \xFD\x9C |0

Still appears in 2022.ucm, so that this character is retained.

Chao Li (Evan)
--------------------
HighGo Software Co., Ltd.
https://www.highgo.com/

> On Aug 11, 2025, at 13:50, John Naylor <[email protected]> wrote:
> 
> On Mon, Aug 11, 2025 at 9:01 AM Chao Li <[email protected]> wrote:
>> 
>> I have created a patch https://commitfest.postgresql.org/patch/5954/. 
>> CommitFests requested a rebase, so I rebased the code and created the v2 
>> patch.
>> 
>> BTW, I have tested all 66 new characters, 9 not-required characters and 18 
>> changed characters in a way as:
> 
> "9 characters are no longer required by the new standard, but are
> retained in this patch for compatibility"
> 
> How is that done?
> 
>> I added a test case with a mapping changed char, and the test passes:
>> 
>> % make check
>> ...
>> # All 229 tests passed.
>> 
>> For more details on the standard change, see 
>> https://ken-lunde.medium.com/the-gb-18030-2022-standard-3d0ebaeb4132
>> 
>> I am attaching the patch file.
> 
> Going from the old .xml file to the .ucm file makes it difficult to
> see the relevant changes. Also, there are nearly 1000 non-user-visible
> changes like this in the output file that are not explained:
> 
> -  /*** Three byte table, leaf: efa8xx - offset 0x07aba ***/
> +  /*** Three byte table, leaf: efa8xx - offset 0x07b3a ***/
> 
> The 2000 version is available in the .ucm format, so maybe converting
> to that first would be a good preparatory patch:
> 
> https://github.com/unicode-org/icu-data/blob/main/charset/data/ucm/gb-18030-2000.ucm
> 
> Looking at the history, it looks like that file has seen small
> revisions, so it may take some research to get the exact equivalent to
> the XML file we use. That will also tell us if anything will change
> for us besides the actual 2022 revision.
> 
> -- 
> John Naylor
> Amazon Web Services

Re: GB18030-2022 Support in PostgreSQL

Reply via email to