At 7:38 PM +0200 9/16/06, A. Pagaltzis wrote:
* Darren Duncan <[EMAIL PROTECTED]> [2006-09-09 20:40]:
 4.  Make UTF-8 the default HTTP response character encoding,
 and the default declared charset for text/* MIME types, and
 explicitly declare that this is what the charset is.  The only
 time that output should be anything else, even Latin-1, is if
 the programmer specifies such.

No, please don't. For unknown MIME types, the charset should be
undeclared. In particular, `application/octet-stream` should
never have a charset forced on it if one is not assigned by the
client code explicitly. Likewise, for `application/xml` and
`application/*+xml`, a charset should NEVER be explicitly
declared, as XML documents are self-describing, whereas declaring
a charset forces using the charset declared in the HTTP header.
This is very unwise (cf. Ruby's Postulate).

Look again; I was only specifying that a default charset is used for text/* MIME types, not non-text/* MIME types; the latter would typically have no charset as you say.

 > 5.  Similarly, default to trying to treat the HTTP request as
 UTF-8 if it doesn't specify a character encoding; fallback to
 Latin-1 only if the text parts of the HTTP request don't look
 like valid UTF-8.

This is not just unwise, it is actually wrong. Latin-1 is the
default for `text/*` MIME types if no charset is declared. Using
a different charset in violation of the HTTP RFCs is __BROKEN__.

Okay, I retract that suggestion. Because the official HTTP spec says no-explicit-charset-means-Latin1.

In fact, now that I'm writing all this out, I am starting to
think that maybe CGI.pm6 should simply punt on charsets as CGI.pm
does. Otherwise, the code and API would have to have able to deal
with the full complexity of charsets in HTTP, and the docs would
have to explain it, which is no picnic at all.

I disagree. Regardless of the details, a Perl 6 replacement for CGI.pm *should* handle character set issues. Its users should simply be able to pull out correctly interpreted ready-to-use Str values when the HTTP request content type is text, and not have to know about what character set was used in the request. Analagously, if the user takes their Str values and supplies them to an HTTP response whose content type is text, they should not have to specify an output encoding if they don't want to, and UTF-8 is the best default because it can handle all possible characters that the Str repetoire can represent.

The CGI.pm replacement by no means has to do the dirty work of processing encodings itself, such as mapping bytes to chars etc. Those details would be handled by something else, such as either Perl 6 itself or a Perl 6 analogy to Encode.pm.

-- Darren Duncan

Reply via email to