I changed the subject to be more appropriate. On 2002.01.30, at 01:27, Nick Ing-Simmons wrote: > In an ideal world CGI.pm (which is also bundled with perl these days) > will have done any _utf8_on() magic that is required - usually > by looking at the charset attribute of the media type and then > calling Encode::decode() to convert data into perl's internal form. > Likewise other CGI assist modules should do likewise - what they need > is a well defined Encode module that allows them to do what the > standards > say without having to re-invent everything themselves.
It is up to Lincoln Stein to decide which way to go but the demand to keep CGI.pm compatible to older version of Perl should be so high that we should not count on that. Or one should implement more modern version thereof under a different name space, as Lincoln admits that. > So the CGI scripter just has to work with perl's strings (encoded as > perl sees fit), and then just "hint" (if necessary) to CGI module how > it should be encoded for transport back. I would expect the CGI.pm code > to make sensible choices without hints in most cases - e.g. reply > in same encoding as request was received in. That "same encoding" is somewhat problematic especially when Japanese is involved. One problem that later version of CGI.pm caused was exactly that (I forgot which version it was). Before the change charset= part of Content-Type: was not sent so it was up to HTML body to tell the browser which charset to use. Now charset="ISO-8859-1" is appended by default while the users of CGI.pm keep sending in Shift JIS, EUC or ISO-2022-JP. Actually charset is in Japan has gotten even more complicated when NTT DOCOMO introduced (in)famous i-Mode. i-Mode not only uses Shift JIS (the most popular yet most problematic charset used in Japan), they also added their own extension (mostly dingbats that are used like icons). Oh well.... > But we cannot do this yet as Encode does not really support some key > MIME charsets - notably the iso2022 family of escape encodings. > I don't have the standards - they are paper copy things one buys for (I#(B > (cannot find Yen sign on this keyboard) and may understandably be > written > in Japanese - which I cannot read, nor do I have any test data. > (Other than the piles of assumed-Chinese SPAM that I seem to > accumulate - but > I don't know that is "valid".) Right. We need more testers on that. Japanese charsets I know but others I don't know much. > That is the ideal - well formed HTTP requests. We also need to handle > legacy > stuff and "guess" appropriately. But it seems to me that until we have > a solution (with acceptable performance) to the well formed case, > it is pointless to worry about the "guess" case. I think "guess" case is needed only for Japanese. Other CJK situation is not this complicated. Usually "Legacy + UTF8" (That is, GB2312 or UTF8 for Simplified Chinese, for instance). > I will gladly re-word the sections you consider misleading and > check and correct if necessary the ones you consider false. > Can you give me a list as you spot them? I will. > So long as you check your facts as you go that is a welcome > contribution. > But do not for example suggest "you should always do _utf8_on() before > calling encode()" because it isn't true. No I won't but at the same time I still don't know when to and when not to. I think we need more working example before we come up with an idiom.... Dan