>> strlen("\xC4\x85") = 2. strlen((binary)"\xC4\x85") = 4. Not good. It is
>> one character in utf-8.
>
> I'm afraid I don't understand you again..

0xC4 and 0x85 are hex codes for latin small letter a with ogonek in utf-8. ą

<?php
var_dump("ą" == "\xC4\x85");
echo "ą\n";
echo "\xC4\x85";
?>

If script is written in utf-8, I expect bool(true) on var_dump() line. It
is bool(false), when unicode.semantics are turned on. Internal
SquirrelMail character set decoding functions write mapping tables in
hexadecimals or octals. In some cases they evaluate only byte value and
not whole symbol. Multibyte character set decoding can use recode, iconv
and mbstring, but most of single byte decoding is written in plain string
functions and stores hex to html mapping tables in associative arrays.

<?php
// example uses utf-8. similar code is used in iso-8859-2 -
// iso-8859-16 decoding. utf-8 decoding does not need mapping tables
// and is written in pcre.
$s1 = "ą";
$s2 = "\xC4\x85";
echo str_replace($s2,'&#261;',$s1);
?>

Expected result: &#261;
Got: ą

test setup (php6.0-200705190630) uses trimmed php.ini with only
unicode.semantics=on setting

unicode.fallback_encoding - no value
unicode.filesystem_encoding - no value
unicode.http_input_encoding - no value
unicode.output_encoding - no value
unicode.runtime_encoding - no value
unicode.script_encoding - no value
unicode.semantics - On
unicode.stream_encoding - UTF-8

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to