Encode and CGI

Dan Kogai Tue, 29 Jan 2002 08:57:36 -0800

I changed the subject to be more appropriate.

On 2002.01.30, at 01:27, Nick Ing-Simmons wrote:
> In an ideal world CGI.pm (which is also bundled with perl these days)
> will have done any _utf8_on() magic that is required - usually
> by looking at the charset attribute of the media type and then
> calling Encode::decode() to convert data into perl's internal form.
> Likewise other CGI assist modules should do likewise - what they need
> is a well defined Encode module that allows them to do what the 
> standards
> say without having to re-invent everything themselves.


   It is up to Lincoln Stein to decide which way to go but the demand to 
keep CGI.pm compatible to older version of Perl should be so high that 
we should not count on that.
   Or one should implement more modern version thereof under a different 
name space, as Lincoln admits that.

> So the CGI scripter just has to work with perl's strings (encoded as
> perl sees fit), and then just "hint" (if necessary) to CGI module how
> it should be encoded for transport back. I would expect the CGI.pm code
> to make sensible choices without hints in most cases - e.g. reply
> in same encoding as request was received in.

   That "same encoding" is somewhat problematic especially when Japanese 
is involved.  One problem that later version of CGI.pm caused was 
exactly that (I forgot which version it was).  Before the change 
charset= part of Content-Type: was not sent so it was up to HTML body to 
tell the browser which charset to use.  Now charset="ISO-8859-1" is 
appended by default while the users of CGI.pm keep sending in Shift JIS, 
EUC or ISO-2022-JP.
   Actually charset is in Japan has gotten even more complicated when NTT 
DOCOMO introduced (in)famous i-Mode.  i-Mode not only uses Shift JIS 
(the most popular yet most problematic charset used in Japan), they also 
added their own extension (mostly dingbats that are used like icons).  
Oh well....

> But we cannot do this yet as Encode does not really support some key
> MIME charsets - notably the iso2022 family of escape encodings.
> I don't have the standards - they are paper copy things one buys for ｣
> (cannot find Yen sign on this keyboard) and may understandably be 
> written
> in Japanese - which I cannot read, nor do I have any test data.
> (Other than the piles of assumed-Chinese SPAM that I seem to 
> accumulate - but
> I don't know that is "valid".)

   Right.  We need more testers on that.  Japanese charsets I know but 
others I don't know much.

> That is the ideal - well formed HTTP requests. We also need to handle 
> legacy
> stuff and "guess" appropriately. But it seems to me that until we have
> a solution (with acceptable performance) to the well formed case,
> it is pointless to worry about the "guess" case.

   I think "guess" case is needed only for Japanese.  Other CJK situation 
is not this complicated.  Usually "Legacy + UTF8" (That is, GB2312 or 
UTF8 for Simplified Chinese, for instance).

> I will gladly re-word the sections you consider misleading and
> check and correct if necessary the ones you consider false.
> Can you give me a list as you spot them?

   I will.

> So long as you check your facts as you go that is a welcome 
> contribution.
> But do not for example suggest "you should always do _utf8_on() before
> calling encode()" because it isn't true.

   No I won't but at the same time I still don't know when to and when 
not to.  I think we need more working example before we come up with an 
idiom....

Dan

Encode and CGI

Reply via email to