At Fri, 30 Oct 2020 14:38:30 +0900, Amit Langote <amitlangot...@gmail.com> wrote in > On Fri, Oct 30, 2020 at 12:20 PM Kyotaro Horiguchi > <horikyota....@gmail.com> wrote: > > So ping-pong between Unicode and SJIS behaves like this: > > > > U+2212 => 0x817c@sjis => U+ff0d => 0x817c@sjis ... > > Is it the following piece of code in UCS_TO_SJIS.pl that manually adds > the mapping?
Yes. > # Add these UTF8->SJIS pairs to the table. > push @$mapping, > ... > { > direction => FROM_UNICODE, > ucs => 0x2212, > code => 0x817c, > comment => '# MINUS SIGN', > f => $this_script, > l => __LINE__ > }, > > Given that U+2212 is encoded by e28892 in utf8, I assume that's how > utf8_to_sjis.map ends up with the following mapping into sjis for that > byte sequence: > > /*** Three byte table, leaf: e288xx - offset 0x004ee ***/ > > /* 80 */ 0x81cd, 0x0000, 0x81dd, 0x81ce, 0x0000, 0x0000, 0x0000, 0x81de, > /* 88 */ 0x81b8, 0x0000, 0x0000, 0x81b9, 0x0000, 0x0000, 0x0000, 0x0000, > /* 90 */ 0x0000, 0x8794, "0x817c", ... I'm not sure how we should construct our won mapping, but the difference made by we simply moved to JIS0208.TXT based as Ishii-san suggested the differences in the mapping would be as the follows. 1. The following codes (regions) are not defined in JIS0208. 8ea1 - 8edf (up to 64 characters (I didn't actually counted them.)) ada1 - adfc (up to 92 characters (ditto)) 8ff3f3 - 8ff4a8 (up to 182 characters (ditto)) a1c0 ff3c: (ff3c: FULLWIDTH REVERSE SOLIDUS) 8ff4aa ff07: (ff07: FULLWIDTH APOSTROPHE) 2. some individual differences EUC 0208 932 a1c1 301c ff5e: (301c:WAVE DASH) a1c2 2016 2225: (2016:DOUBLE_VERTICAL LINE) : (2225:PARALLEL TO) * a1dd 2212 ff0d: (2212: MINUS_SIGN) : (ff0d: FULLWIDTH HYPHEN-MINUS) d1f1 a2 ffe0: (00a2: CENT SIGN) : (ffe0: FULLWIDTH CENT SIGN) d1f2 a3 ffe1: (00a3: PUND SIGN) : (ffe1: FULLWIDTH POUND SIGN) a2cc ac ffe2: (00ac: NOT SIGN) : (ffe2: FULLWIDTH NOT SIGN) *1: https://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0208.TXT > > > Please note that the byte sequence (81-7c) in SJIS represents MINUS > > > SIGN in SJIS which means the MINUS SIGN in UTF8 got converted to the > > > MINUS SIGN in SJIS and that is what we expect. Isn't it? > > > > I think we don't change authoritative mappings, but maybe can add some > > one-way conversions for the convenience. > > Maybe UCS_TO_EUC_JP.pl could do something like the above. > > Are there other cases that were fixed like this in the past, either > for euc_jp or sjis? Honestly, I don't know how the mapping was decided in 2002, but removing the regions in 1 would cause confusion. So what we can do in this area would be chaning some of 2 to 0208 mapping. But arbitrary mixture of different mapings would cause new problem.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center