Hi

I think following PHP 5.4.0 NEWS entry is misleading.

  . Changed default value of "default_charset" php.ini option from ISO-8859-1 to
    UTF-8. (Rasmus)

I thought default_charset became UTF-8, so I was expecting
following HTTP header.

content-type    text/html; charset=UTF-8

However, I got empty charset (missing 'charset=UTF-8').
So I looked up to source and found the line in SAPI.h

293     #define SAPI_DEFAULT_CHARSET        ""

Empty string should be "UTF-8", isn't it?

BTW, empty charset in HTTP header does not mean the default will
be ISO-8859-1, but it let browser guess the encoding is used.
Guessing encoding may cause XSS under certain conditions.


Anyway, I was curious so I've checked ext/standard/html.c and found

/* {{{ entity_charset determine_charset
 * returns the charset identifier based on current locale or a hint.
 * defaults to UTF-8 */
static enum entity_charset determine_charset(char *charset_hint TSRMLS_DC)
{
        int i;
        enum entity_charset charset = cs_utf_8;
        int len = 0;
        const zend_encoding *zenc;

        /* Default is now UTF-8 */
        if (charset_hint == NULL)
                return cs_utf_8;


There are 2 problems.

 - php.ini's default_charset should be UTF-8.
 - determine_charset() should not blindly default to UTF-8 when there
are no hint.

Old htmlentities/htmlspecialchars actually determines charset from
default_charset/mbstring.internal_encoding/etc. I think old behavior
is better than now.

How about make determine_charset() behaves like 5.3 and set the
SAPI_DEFAULT_CHARSET to "UTF-8"?

Then PHP will behave like as NEWS mentions, htmlentities/htmlspecialchars
default encoding became 'UTF-8' and users will have control for default
htmlenties/htmlspecialchars encoding.

Regards,

--
Yasuo Ohgaki
yohg...@ohgaki.net

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to