On Monday, April 15, 2002, at 07:29 , Nick Ing-Simmons wrote:
> I tracked down the "problem" tkmail was/is having with iso-2022-jp.
> The snag is I am using the API the way I designed it, not the way
> it is reliably implemented.
>
> When called thus:
>
> my $decoded = $enc->decode($encoded,1);
>
> decode is supposed to return portion it can decode, and set $encoded
> to what remains.

Ah,  I see.  But it is pain in the arse for "doubly-encoded" encodings 
like ISO-2022-JP.

Here is the problem.  As you see, to decode ISO-2022-JP, we first have 
to decode it into EUC-JP.  And ISO-2022-JP -> EUC-JP is treated (and 
should be treated) purely as a CES so there is no chance for error 
(unless there is a bogus escape sequence).  However, errors may rise 
when you try to convert the resulting EUC-JP stream to UTF-8.

The problem is that not all of the possible code points in JIS X 0208 
and JIS X 0212 are actually used (94x94 = 8836).  of which only 6884 are 
used in 0208 and 6072 are used in 0212.  So the remainder won't map to 
Unicode.

It was possible to use jis02*-raw instead of EUC-JP but that 
implementation was too slow because you have to invoke encode() chunk by 
chunk.  in fact I tried and it got 3 times as slow.

And what is a sense of "what remain" gets moot when it comes to 
ISO-2022.  Suppose you got a string like this;

abcd<ESC-to-jis0208>cdefghijklmn<ESC-to-ascii>opqrstu....
                         ^^error occurs here.

What's the remaining stream?

ghijklmn<ESC-to-ascii>opqrstu....


is WRONG because we are now in jis0208 chunk and escape sequence is 
already stripped.  Do we have to go like

<ESC-to-jis0208>ghijklmn<ESC-to-ascii>opqrstu....

but that slows down the encoder too much.   I just woke up.  Let me 
think about this a little bit more....

Dan the Encode Maintainer

Reply via email to