Hi John, Thanks for your review.
Yes, I did a diff between 2000.ucm and 2022.ucm when I worked on the patch. The diff between 2000.ucm and 2022.ucm are quite small: ```diff - omit the comment part > <U20AC> \x80 |3 > <U3000> \xA3\xA0 |3 > <UE5E5> \xA3\xA0 |4 > 28067a28099,28114 > <U9FB4> \xFE\x59 |0 > <U9FB4> \x82\x35\x90\x37 |3 > <U9FB5> \xFE\x61 |0 > <U9FB5> \x82\x35\x90\x38 |3 > <U9FB6> \xFE\x66 |0 > <U9FB6> \x82\x35\x90\x39 |3 > <U9FB7> \xFE\x67 |0 > <U9FB7> \x82\x35\x91\x30 |3 > <U9FB8> \xFE\x6D |0 > <U9FB8> \x82\x35\x91\x31 |3 > <U9FB9> \xFE\x7E |0 > <U9FB9> \x82\x35\x91\x32 |3 > <U9FBA> \xFE\x90 |0 > <U9FBA> \x82\x35\x91\x33 |3 > <U9FBB> \xFE\xA0 |0 > <U9FBB> \x82\x35\x91\x34 |3 29577c29624 < <UE5E5> \xA3\xA0 |0 --- > # <UE5E5> \xA3\xA0 |0 30001,30010c30048,30057 < <UE78D> \xA6\xD9 |0 < <UE78E> \xA6\xDA |0 < <UE78F> \xA6\xDB |0 < <UE790> \xA6\xDC |0 < <UE791> \xA6\xDD |0 < <UE792> \xA6\xDE |0 < <UE793> \xA6\xDF |0 < <UE794> \xA6\xEC |0 < <UE795> \xA6\xED |0 < <UE796> \xA6\xF3 |0 --- > <UE78D> \xA6\xD9 |1 > <UE78E> \xA6\xDA |1 > <UE78F> \xA6\xDB |1 > <UE790> \xA6\xDC |1 > <UE791> \xA6\xDD |1 > <UE792> \xA6\xDE |1 > <UE793> \xA6\xDF |1 > <UE794> \xA6\xEC |1 > <UE795> \xA6\xED |1 > <UE796> \xA6\xF3 |1 30146c30193 < <UE81E> \xFE\x59 |0 --- > <UE81E> \xFE\x59 |1 30154c30201 < <UE826> \xFE\x61 |0 --- > <UE826> \xFE\x61 |1 30159,30160c30206,30207 < <UE82B> \xFE\x66 |0 < <UE82C> \xFE\x67 |0 --- > <UE82B> \xFE\x66 |1 > <UE82C> \xFE\x67 |1 30166c30213 < <UE832> \xFE\x6D |0 --- > <UE832> \xFE\x6D |1 30183c30230 < <UE843> \xFE\x7E |0 --- > <UE843> \xFE\x7E |1 30200c30247 < <UE854> \xFE\x90 |0 --- > <UE854> \xFE\x90 |1 30216c30263 < <UE864> \xFE\xA0 |0 --- > <UE864> \xFE\xA0 |1 30470a30518,30537 > <UFE10> \xA6\xD9 |0 > <UFE10> \x84\x31\x82\x36 |3 > <UFE11> \xA6\xDB |0 > <UFE11> \x84\x31\x82\x37 |3 > <UFE12> \xA6\xDA |0 > <UFE12> \x84\x31\x82\x38 |3 > <UFE13> \xA6\xDC |0 > <UFE13> \x84\x31\x82\x39 |3 > <UFE14> \xA6\xDD |0 > <UFE14> \x84\x31\x83\x30 |3 > <UFE15> \xA6\xDE |0 > <UFE15> \x84\x31\x83\x31 |3 > <UFE16> \xA6\xDF |0 > <UFE16> \x84\x31\x83\x32 |3 > <UFE17> \xA6\xEC |0 > <UFE17> \x84\x31\x83\x33 |3 > <UFE18> \xA6\xED |0 > <UFE18> \x84\x31\x83\x34 |3 > <UFE19> \xA6\xF3 |0 > <UFE19> \x84\x31\x83\x35 |3 ``` As you can see, the changes only reflect to the changed 18 characters plus other 3 unicode points (U20AC, U3000, UE5E5). My code comment in UCS_to_GB18030.pl has explained these changes: ```code comment from UCS_to_GB18030.pl # The |n is a flag, where n has values of 0, 1, 3, 4. # With a refeence to https://ken-lunde.medium.com/the-gb-18030-2022-standard-3d0ebaeb4132, # the flag should mean the following: # 0 - round-trip mapping # 1 - there are 18 mappings with flag 1, those are mapping changes # from GB180303-2000 to GB18030-2022. Old mappings are marked # with flag 1, new mappings with flag 0. So we can ignore all # mappings with flag 0. # 3 - there are 20 mappings with flag 3: # 18 of them reflect to the 18 mappings with flag 1, but means # the old mapping's unicode's new mapping with GB18030-2022. # These 18 new mappings have no actual glyphs in GB18030-2022. # So we can ignore these 18 mappings with flag 3. # The other 2 are: "<U20AC> \x80 |3" and "<U3000> \xA3\xA0 |3". # They are two reserved fallbacks for compatibility with GBK and # other web data as in WHATWG. Both U20AC and U3000 have round- # trip mappings in GB18030-2022, so we can ignore these two # mappings with flag 3. # So, we can ignore all mappings with flag 3. # 4 - there is only one mapping with flag 4: <UE5E5> \xA3\xA0 |4. # This is a "good one-way" mapping from U+E5E5 to \xA3\xA0 # for maximum compatibility with previous behavior. So we can # ignore this mapping as well. ``` For your question: > "9 characters are no longer required by the new standard, but are > retained in this patch for compatibility" > > How is that done? The 9 mappings are not changed between 2000.ucm and 2022.ucm. For example, GB18030 code 0xFD9C is one of the 9 not-required code, but the mapping: <UF92C> \xFD\x9C |0 Still appears in 2022.ucm, so that this character is retained. Chao Li (Evan) -------------------- HighGo Software Co., Ltd. https://www.highgo.com/ > On Aug 11, 2025, at 13:50, John Naylor <johncnaylo...@gmail.com> wrote: > > On Mon, Aug 11, 2025 at 9:01 AM Chao Li <li.evan.c...@gmail.com> wrote: >> >> I have created a patch https://commitfest.postgresql.org/patch/5954/. >> CommitFests requested a rebase, so I rebased the code and created the v2 >> patch. >> >> BTW, I have tested all 66 new characters, 9 not-required characters and 18 >> changed characters in a way as: > > "9 characters are no longer required by the new standard, but are > retained in this patch for compatibility" > > How is that done? > >> I added a test case with a mapping changed char, and the test passes: >> >> % make check >> ... >> # All 229 tests passed. >> >> For more details on the standard change, see >> https://ken-lunde.medium.com/the-gb-18030-2022-standard-3d0ebaeb4132 >> >> I am attaching the patch file. > > Going from the old .xml file to the .ucm file makes it difficult to > see the relevant changes. Also, there are nearly 1000 non-user-visible > changes like this in the output file that are not explained: > > - /*** Three byte table, leaf: efa8xx - offset 0x07aba ***/ > + /*** Three byte table, leaf: efa8xx - offset 0x07b3a ***/ > > The 2000 version is available in the .ucm format, so maybe converting > to that first would be a good preparatory patch: > > https://github.com/unicode-org/icu-data/blob/main/charset/data/ucm/gb-18030-2000.ucm > > Looking at the history, it looks like that file has seen small > revisions, so it may take some research to get the exact equivalent to > the XML file we use. That will also tell us if anything will change > for us besides the actual 2022 revision. > > -- > John Naylor > Amazon Web Services