On 09/07/2016 09:50 AM, Kyotaro HORIGUCHI wrote:
Hi,
I found an useless entry in utf8_to_sjis.map
{0xc19c, 0x815f},
which is apparently illegal as UTF-8 which postgresql
deliberately refuses. So it should be removed and the attached
patch does that. 0x815f(SJIS) is also mapped from 0xefbcbc(U+FF3C
FULLWIDTH REVERSE SOLIDUS) and it is a right mapping.
Yes, I think you're right. Committed, thanks!
By the way, the file comment at the beginning of UCS_to_SJIS.pl
is the following.
# Generate UTF-8 <--> SJIS code conversion tables from
# map files provided by Unicode organization.
# Unfortunately it is prohibited by the organization
# to distribute the map files. So if you try to use this script,
# you have to obtain SHIFTJIS.TXT from
# the organization's ftp site.
The file was found at the following place thanks to google.
ftp://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/
As the URL is showing, or as written in the file
Public/MAPPINGS/EASTASIA/ReadMe.txt, it is already obsolete and
the *live* definition *may* be found in Unicode Character
Database. But I haven't found SJIS-related informatin there.
>
If I'm not missing anything, the only available authority would
be JIS X 0208/0213 but what should be implmented seems to be
maybe-modified MS932 for which I don't know the authority.
Anyway I ran UCS_to_SJIS.pl with the SHIFTJIS.TXT above and I got
a quite different mapping files from the current ones.
So, I wonder how the mappings related to SJIS (and/or EUC-JP) are
maintained. If no authoritative information is available, the
generating script no longer usable. If any other autority is
choosed, it is to be modified according to whatever the new
source format is.
The script is clearly intended to read CP932.TXT, rather than
SHIFTJIS.TXT, despite the comments in it. CP932.TXT can be found at
ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT
However, running the script with that doesn't produce exactly what we
have in utf8_to_sjis.map, either. It's otherwise same, but we have some
extra mappings:
- {0xc2a5, 0x5c},
- {0xc2ac, 0x81ca},
- {0xe28096, 0x8161},
- {0xe280be, 0x7e},
- {0xe28892, 0x817c},
- {0xe3809c, 0x8160},
Those mappings were added in commit
a8bd7e1c6e026678019b2f25cffc0a94ce62b24b, back in 2002. The bogus
mapping for the invalid 0xc19c UTF-8 byte sequence was also added by
that commit, as well a few valid mappings that UCS_to_SJIS.pl also produces.
I can't judge if those mappings make sense. If we can't find an
authoritative source for them, I suggest that we leave them as they are,
but also hard-code them to UCS_to_SJIS.pl, so that running that script
produces those mappings in utf8_to_sjis.map, even though they are not
present in the CP932.TXT source file.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers