[Encode/ISO-2022] KR is done. CN to go.

Dan Kogai Wed, 27 Mar 2002 16:11:41 -0800

Jungshik,

   First, Thank you so much (as much as the number of code points for all 
Korean charset combined!) for submitting a patch so quickly.  It was 
applied hairlessly.
   I am now hopeful that 1.00 will be shipped in next 24 hours.  
Coincidentally, it is 09:00 JST, meaning 00:00 Zulu.

On Thursday, March 28, 2002, at 08:01 , Jungshik Shin wrote:
>   Yeah, that's a common mistake made by (Japanese) programmers when
> they didn't bother to read RFC 1557 (or Ken Lunde's book) :-)

   I wonder where that 2022.enc came from.  Has nobody touched *.euc that 
came from Tcl?  If so, NI-S, you should issue a warning to Tcl/Tk 
community!

>   You're very welcome :-) But 2022-kr.enc is not used any more,
> is it? I patched lib/Encode/KR/2022_KR.pm  instead.  It's not perfect 
> for
> encoding, but nobody really needs it any more...  ISO-2022-KR decoder
> is still of use because there are some old emails floating around in
> people's mailboxes and some (outdated) programs still generate it.
> (this is why Mozilla has ISO-2022-KR decoder but doesn't have the 
> encoder)

   For ISO-2022 in general, decoder is easy but encoder is pain in the 
neck.  The problem is where to insert the correct escape sequence and in 
order to do so, you have to know what character set you want to 
designate.  And as we know very well, Unicode characters by itself tells 
nothing of the origin.  This is the very reason that I am abstaining 
from implementing ISO-2022-JP-2, which has to handle JIS X 0208, 0212, 
GB2312, and KS C 5601 together.  decoding to UTF-8 is easy via EUC-X (I 
have JP, KR, and CN already).  But to encode back to UTF-8, you need to 
somehow tell which charset the character belongs but Character 
Unification makes it impossible.  At very least, round-trip is 
impossible.  You need to have a database whether or which charset a 
given Unicode character have a code point and give precedence to charset 
and pick the one accordingly.  Since this is JP-2 we are talking about, 
I would try JIS X first, then GB, then KS C or something like that....
   Fortunately (at least for me;  (in)?famous morta-san may think 
otherwise)
ISO-2022-JP-2 is not prevalent yet but the quick glance at google finds 
several remarks to =?ISO-2022-JP2?b...,  obviously from ML archives.  So 
they still in use, unlike ISO-2022-KR.

>   ISO-2022-KR is very rarely used these days. It MUST NOT be
> used for outgoing messages any more. However, the decoder is still handy
> to have (see above.)

   You capped MUST NOT.  Not even *depreciated*.  Is this de facto or de 
jure ?

>    One (rather drastic) way to reduce the number of spam mails
> is to just filter out email messages with MIME charset 'ks_c_5601-1987'
> and C-T 'text/html'.

   Well, a moderate number of spams are okay to me;  I even enjoy them 
sometimes and they were useful in the course of forging Encode :)

>  Spammers are much more likely to use non-standard
> and broken mail programs than non-spammers (at least in Korea).

   Glad to hear that.  What is the socially accepted way to include 
Korean messages in MIME header?  =?euc-kr?b...  good enough?  Or do you 
guys prefer quoted-printable?  Or Korea is so much into the future and 
=?UTF-8?b= is the standard :?

>   In case of ISO-2022-KR, you could have used 'ksc5601-raw' just like
> HZ.pm uses 'gb2312-raw'.  That's not the case in ISO-2022-CN encoding,
> though. For ISO-2022-CN decoding, I believe you can still go without
> mock encoding but can use cns11643p1 and cns11643p2 along with 
> gb2312-raw.

   Right.  So far as decoding to UTF-8 is concerned, you don't need EUC 
so long as you have raw encodings.  Maybe I was too obsessed with an 
idea of bidirectionality.  I feel more relieved now with your words.  
But I still feel somewhat arrogant to leave the door half-open, or in 
this particular case, a trap door.
   If I were a die-hard Unicode activist, I would have made only decode() 
available and coerce UTF-8 for all output :)

>   Has ISO-2022-CN ever been used for email exchanges? The lead
> engineer of Pine dev. team at U. of Washington (whose name is escaping
> me at the moment) and one of the author of RFC 1922 once wrote that he
> had received a handful of emails in ISO-2022-CN, but I have yet
> to receive a single message in ISO-2022-CN with both GB2312
> and CNS 11643-[12].

   I have no idea either.  Let's wait Autrijus on this....

>   Alternative way to deal with it at the moment is just support US-ASCII
> and GB2312.

   That I have done.  But It was too well-documented in RFC and there is 
no such things as ISO-2022-CN-0, or a souped-down version thereof.

>   Pls, take a look at my patch for ISO-2022-KR and modify it as you see 
> fit.
> (I haven't set up my perl-testing env. yet so that I didn't test it).

   I have.  Another welcome thing is test data.  See t/*.euc and 
t/*.ref.  t/(JP|KR).t does a round-trip matching test to see if it is 
okay.

   Anyway, Kamsahamnida!

Dan the Encode Maintainer

[Encode/ISO-2022] KR is done. CN to go.

Reply via email to