At 7:38 PM +0200 9/16/06, A. Pagaltzis wrote:
* Darren Duncan <[EMAIL PROTECTED]> [2006-09-09 20:40]:
4. Make UTF-8 the default HTTP response character encoding,
and the default declared charset for text/* MIME types, and
explicitly declare that this is what the charset is. The only
time that output should be anything else, even Latin-1, is if
the programmer specifies such.
No, please don't. For unknown MIME types, the charset should be
undeclared. In particular, `application/octet-stream` should
never have a charset forced on it if one is not assigned by the
client code explicitly. Likewise, for `application/xml` and
`application/*+xml`, a charset should NEVER be explicitly
declared, as XML documents are self-describing, whereas declaring
a charset forces using the charset declared in the HTTP header.
This is very unwise (cf. Ruby's Postulate).
Look again; I was only specifying that a default charset is used for
text/* MIME types, not non-text/* MIME types; the latter would
typically have no charset as you say.
> 5. Similarly, default to trying to treat the HTTP request as
UTF-8 if it doesn't specify a character encoding; fallback to
Latin-1 only if the text parts of the HTTP request don't look
like valid UTF-8.
This is not just unwise, it is actually wrong. Latin-1 is the
default for `text/*` MIME types if no charset is declared. Using
a different charset in violation of the HTTP RFCs is __BROKEN__.
Okay, I retract that suggestion. Because the official HTTP spec says
no-explicit-charset-means-Latin1.
In fact, now that I'm writing all this out, I am starting to
think that maybe CGI.pm6 should simply punt on charsets as CGI.pm
does. Otherwise, the code and API would have to have able to deal
with the full complexity of charsets in HTTP, and the docs would
have to explain it, which is no picnic at all.
I disagree. Regardless of the details, a Perl 6 replacement for
CGI.pm *should* handle character set issues. Its users should simply
be able to pull out correctly interpreted ready-to-use Str values
when the HTTP request content type is text, and not have to know
about what character set was used in the request. Analagously, if
the user takes their Str values and supplies them to an HTTP response
whose content type is text, they should not have to specify an output
encoding if they don't want to, and UTF-8 is the best default because
it can handle all possible characters that the Str repetoire can
represent.
The CGI.pm replacement by no means has to do the dirty work of
processing encodings itself, such as mapping bytes to chars etc.
Those details would be handled by something else, such as either Perl
6 itself or a Perl 6 analogy to Encode.pm.
-- Darren Duncan