A. Pagaltzis skribis 2006-09-16 19:38 (+0200):
> * Darren Duncan <[EMAIL PROTECTED]> [2006-09-09 20:40]:
> > 4. Make UTF-8 the default HTTP response character encoding, and the
> > default declared charset for text/* MIME types, and explicitly
> > declare that this is what the charset is. The only time that output
> > should be anything else, even Latin-1, is if the programmer
> > specifies such.
> No, please don???t. For unknown MIME types, the charset should be
> undeclared. In particular, `application/octet-stream` should never
> have a charset forced on it if one is not assigned by the client code
> explicitly. Likewise, for `application/xml` and `application/*+xml`, a
> charset should NEVER be explicitly declared, as XML documents are
> self-describing, whereas declaring a charset forces using the charset
> declared in the HTTP header. This is very unwise (cf. Ruby???s
> Postulate).
Darren discussed the *default* encoding. Like how text/html is a nice
default for the MIME-type, UTF-8 is a nice encoding. Both should be
overridable.
My thoughts:
* Default Content-Type header of "text/html; charset=UTF-8".
* Default output encoding of UTF-8.
* When a new Content-Type is set, but no new encoding
* Keep the default output encoding of UTF-8
* Warn if it's text/* without /charset=/
* Use the specified charset as the output encoding
* Change the output encoding to raw bytes if it's not text/*
* When a new Content-Type is set, and a new encoding is given
* Use the supplied encoding
* Warn if it's text/* without /charset=/
* Warn if supplied encoding and charset aren't equal enough
I think it's important to realise that only text/* have charset, and
that Content-Type is MIME-type plus charset in one value. We shouldn't
be "clever" and separate these: they're one string.
For XML, you'd have to explicitly mention Content-Type and encoding,
because the encoding can no longer be taken from the Content-Type, and
the default for non-text/* is raw bytes.
> > 5. Similarly, default to trying to treat the HTTP request as
> > UTF-8 if it doesn't specify a character encoding; fallback to
> > Latin-1 only if the text parts of the HTTP request don't look
> > like valid UTF-8.
> This is not just unwise, it is actually wrong. Latin-1 is the
> default for `text/*` MIME types if no charset is declared. Using
> a different charset in violation of the HTTP RFCs is __BROKEN__.
Agreed.
> In fact, now that I???m writing all this out, I am starting to
> think that maybe CGI.pm6 should simply punt on charsets as CGI.pm
> does. Otherwise, the code and API would have to have able to deal
> with the full complexity of charsets in HTTP, and the docs would
> have to explain it, which is no picnic at all.
Simple schemes can always be documented equally simply.
A first attempt:
The default value for the C<Content-Type> header is C<text/html;
charset=UTF-8>
The encoding that $module uses for output data is taken from the
C<charset> attribute in the C<Content-Type> header. If there is no
charset in the C<Content-Type> header, UTF-8 is used for all text/*
types, and raw for everything else.
It is possible to explicitly force an output encoding. When you're
not sending a text/* document, you need to do this if the document
does contain text. This is the case with most XML formats.
$response1.type = 'text/html; charset=iso-8859-1';
# implies: $response1.encoding = 'iso-8859-1;
$response2.type = 'application/xml';
$response2.encoding = 'UTF-8';
my $response3 = Web::Response.new :type('text/html;
charset=iso-8859-1');
my $response4 = Web::Response.new :type<application/xml>,
:encoding<UTF-8>;
--
korajn salutojn,
juerd waalboer: perl hacker <[EMAIL PROTECTED]> <http://juerd.nl/sig>
convolution: ict solutions and consultancy <[EMAIL PROTECTED]>