> When I point my browser at
> file:///home/tmunro/projects/postgresql/build/doc/src/sgml/html/multibyte.html
> I see these longer descriptions flowing onto multiple lines making the
> table cells higher, while the published documentation[1] does only a
> small amount of that, and then the font instead becomes smaller as I
> make the window narrower. Is there an easy way to see the final
> website form in a local build?
Same here. It would be nice to know website form in a local build.
> We'd have more free space in the affected rows if we did s/Extended
> UNIX Code-JP/EUC-JP/. Why is that acronym expanded, while ISO, ECMA,
> JIS and CP are not?
Fair point.
> It might be confusing that the style "ISO 8859-1, ECMA 94" is used to
> list alternative encoding standards that are aligned or equivalent,
> while here you're listing the encoding and then the underlying
> character sets in the same way. Would it be better to put them in
> parentheses?
>
> With those two changes we'd have:
>
> EUC_JP | EUC-JP (JIS X 0201, JIS X 0208, JIS X 0212)
> EUC_JIS_2004 | EUC-JP (JIS X 0201, JIS X 0213)
Looks good to me.
> While wondering if some other rows could be more specific, I noticed
> that for GBK we have "Extended National Standard". I don't understand
> these things,
Me neither. Probably "Extended National Standard" comes from the fact
that GB means "national standard" and "K" means "extension". However
actually GBK is not an "official standard" which is mandatory for
Chinese industries to follow [1]. It's kind of strongly recommended
standard to follow. Probably we can just write "Defact standard (CP936)".
> but from a quick look at Wikipedia[2], I got the idea
> that if convert_to('€', 'GBK') = '\x80'::bytea (yes) then what we have
> might actually be the yet-further-extended standard known as "GBK
> 1.0". Do I have that right?
I don't think so. [2] stats that "Microsoft later added the euro sign
to Code page 936 and assigned the code 0x80 to it. This is not a valid
code point in GBK 1.0. " So what we have seems to be CP936. Actually
in UCS_to_most.pl, which is used to generate gdbk_to_utf8.map, has the
line:
'GBK' => 'CP936.TXT');
> As for BIG5, it seems to be an underspecified mess defying description
> other than "good luck" :-)
Yeah, ours is BIG5 (Unicode 1.1) + CP950.
> Thankfully we won't have to list all the
> standards that MULE_INTERNAL indirectly covers, as it looks like we've
> agreed to drop it. And IIRC there was a thread somewhere proposing to
> drop JOHAB...
Apparently JOHAB has not been well tested...
> Makes sense to me. The underlying character sets must be very
> important to understand, especially if implementations vary on these
> points. We should give the information.
Yes.
> . o O ( I wonder if anyone has ever tried to make an "XTF-8-JA"
> encoding just like UTF-8 but with ~1900 high-frequency Japanese
> codepoints swapped into the 2-byte range U+0080-07ff where Greek,
> Hebrew, Arabic and others won the encoding lottery. UTF-16 is
> apparently sometimes preferred to save space in other RDBMSs that can
> do it, but I suppose you could achieve the same size most of the time
> with a scheme like that. The other encodings have the desired size,
> but non-universal character sets. A similar thought for the languages
> of India, but with the frequency fuzziness factor removed: you could
> surely map a dozen tiny non-ideographic scripts into that range to
> save a byte per character... Hindi, Tamil etc didn't get a very good
> deal with UTF-8. Don't worry, I'm not suggesting that PostgreSQL has
> any business inventings its own hair-brained encodings, I'm just
> wondering out loud if that is a kind of thing that exists somewhere
> out there... )
Well, I think inventing internal use only encoding is not a bad thing
in general. We already have number of internal only data
structures. Internal encodings are just one of them. (I am not saying
I want to implement "XTF-8-JA" though).
> [1] https://www.postgresql.org/docs/current/multibyte.html
> [2] https://en.wikipedia.org/wiki/GBK_(character_encoding)
>
[3] https://ja.wikipedia.org/wiki/GBK
Best regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp