Re: Charsets in HTTP (was: the CGI.pm in Perl 6)

Darren Duncan Sat, 16 Sep 2006 15:19:34 -0700

At 7:38 PM +0200 9/16/06, A. Pagaltzis wrote:

* Darren Duncan <[EMAIL PROTECTED]> [2006-09-09 20:40]:

 4.  Make UTF-8 the default HTTP response character encoding,
 and the default declared charset for text/* MIME types, and
 explicitly declare that this is what the charset is.  The only
 time that output should be anything else, even Latin-1, is if
 the programmer specifies such.


No, please don't. For unknown MIME types, the charset should be
undeclared. In particular, `application/octet-stream` should
never have a charset forced on it if one is not assigned by the
client code explicitly. Likewise, for `application/xml` and
`application/*+xml`, a charset should NEVER be explicitly
declared, as XML documents are self-describing, whereas declaring
a charset forces using the charset declared in the HTTP header.
This is very unwise (cf. Ruby's Postulate).

Look again; I was only specifying that a default charset is used fortext/* MIME types, not non-text/* MIME types; the latter wouldtypically have no charset as you say.

 > 5.  Similarly, default to trying to treat the HTTP request as

 UTF-8 if it doesn't specify a character encoding; fallback to
 Latin-1 only if the text parts of the HTTP request don't look
 like valid UTF-8.


This is not just unwise, it is actually wrong. Latin-1 is the
default for `text/*` MIME types if no charset is declared. Using
a different charset in violation of the HTTP RFCs is __BROKEN__.

Okay, I retract that suggestion. Because the official HTTP spec saysno-explicit-charset-means-Latin1.

In fact, now that I'm writing all this out, I am starting to
think that maybe CGI.pm6 should simply punt on charsets as CGI.pm
does. Otherwise, the code and API would have to have able to deal
with the full complexity of charsets in HTTP, and the docs would
have to explain it, which is no picnic at all.

I disagree. Regardless of the details, a Perl 6 replacement forCGI.pm *should* handle character set issues. Its users should simplybe able to pull out correctly interpreted ready-to-use Str valueswhen the HTTP request content type is text, and not have to knowabout what character set was used in the request. Analagously, ifthe user takes their Str values and supplies them to an HTTP responsewhose content type is text, they should not have to specify an outputencoding if they don't want to, and UTF-8 is the best default becauseit can handle all possible characters that the Str repetoire canrepresent.

The CGI.pm replacement by no means has to do the dirty work ofprocessing encodings itself, such as mapping bytes to chars etc.Those details would be handled by something else, such as either Perl6 itself or a Perl 6 analogy to Encode.pm.


-- Darren Duncan

Re: Charsets in HTTP (was: the CGI.pm in Perl 6)

Reply via email to