Re: Charsets in HTTP (was: the CGI.pm in Perl 6)

Juerd Sat, 16 Sep 2006 13:10:34 -0700

A. Pagaltzis skribis 2006-09-16 19:38 (+0200):
> * Darren Duncan <[EMAIL PROTECTED]> [2006-09-09 20:40]:
> > 4.  Make UTF-8 the default HTTP response character encoding, and the
> > default declared charset for text/* MIME types, and explicitly
> > declare that this is what the charset is.  The only time that output
> > should be anything else, even Latin-1, is if the programmer
> > specifies such.
> No, please don???t. For unknown MIME types, the charset should be
> undeclared. In particular, `application/octet-stream` should never
> have a charset forced on it if one is not assigned by the client code
> explicitly. Likewise, for `application/xml` and `application/*+xml`, a
> charset should NEVER be explicitly declared, as XML documents are
> self-describing, whereas declaring a charset forces using the charset
> declared in the HTTP header.  This is very unwise (cf. Ruby???s
> Postulate).


Darren discussed the *default* encoding. Like how text/html is a nice
default for the MIME-type, UTF-8 is a nice encoding. Both should be
overridable.

My thoughts:

    * Default Content-Type header of "text/html; charset=UTF-8".
    * Default output encoding of UTF-8.
    * When a new Content-Type is set, but no new encoding
        * Keep the default output encoding of UTF-8
        * Warn if it's text/* without /charset=/
        * Use the specified charset as the output encoding
        * Change the output encoding to raw bytes if it's not text/*
    * When a new Content-Type is set, and a new encoding is given
        * Use the supplied encoding
        * Warn if it's text/* without /charset=/
        * Warn if supplied encoding and charset aren't equal enough

I think it's important to realise that only text/* have charset, and
that Content-Type is MIME-type plus charset in one value. We shouldn't
be "clever" and separate these: they're one string.

For XML, you'd have to explicitly mention Content-Type and encoding,
because the encoding can no longer be taken from the Content-Type, and
the default for non-text/* is raw bytes.

> > 5.  Similarly, default to trying to treat the HTTP request as
> > UTF-8 if it doesn't specify a character encoding; fallback to
> > Latin-1 only if the text parts of the HTTP request don't look
> > like valid UTF-8.
> This is not just unwise, it is actually wrong. Latin-1 is the
> default for `text/*` MIME types if no charset is declared. Using
> a different charset in violation of the HTTP RFCs is __BROKEN__.

Agreed.

> In fact, now that I???m writing all this out, I am starting to
> think that maybe CGI.pm6 should simply punt on charsets as CGI.pm
> does. Otherwise, the code and API would have to have able to deal
> with the full complexity of charsets in HTTP, and the docs would
> have to explain it, which is no picnic at all.

Simple schemes can always be documented equally simply.

A first attempt:

    The default value for the C<Content-Type> header is C<text/html;
    charset=UTF-8>

    The encoding that $module uses for output data is taken from the
    C<charset> attribute in the C<Content-Type> header. If there is no
    charset in the C<Content-Type> header, UTF-8 is used for all text/*
    types, and raw for everything else.

    It is possible to explicitly force an output encoding. When you're
    not sending a text/* document, you need to do this if the document
    does contain text. This is the case with most XML formats.
        
        $response1.type = 'text/html; charset=iso-8859-1';
        # implies: $response1.encoding = 'iso-8859-1;

        $response2.type = 'application/xml';
        $response2.encoding = 'UTF-8';
        
        my $response3 = Web::Response.new :type('text/html; 
charset=iso-8859-1');
        my $response4 = Web::Response.new :type<application/xml>, 
:encoding<UTF-8>;
-- 
korajn salutojn,

  juerd waalboer:  perl hacker  <[EMAIL PROTECTED]>  <http://juerd.nl/sig>
  convolution:     ict solutions and consultancy <[EMAIL PROTECTED]>

Re: Charsets in HTTP (was: the CGI.pm in Perl 6)

Reply via email to