>> strlen("\xC4\x85") = 2. strlen((binary)"\xC4\x85") = 4. Not good. It is >> one character in utf-8. > > I'm afraid I don't understand you again..
0xC4 and 0x85 are hex codes for latin small letter a with ogonek in utf-8. ą <?php var_dump("ą" == "\xC4\x85"); echo "ą\n"; echo "\xC4\x85"; ?> If script is written in utf-8, I expect bool(true) on var_dump() line. It is bool(false), when unicode.semantics are turned on. Internal SquirrelMail character set decoding functions write mapping tables in hexadecimals or octals. In some cases they evaluate only byte value and not whole symbol. Multibyte character set decoding can use recode, iconv and mbstring, but most of single byte decoding is written in plain string functions and stores hex to html mapping tables in associative arrays. <?php // example uses utf-8. similar code is used in iso-8859-2 - // iso-8859-16 decoding. utf-8 decoding does not need mapping tables // and is written in pcre. $s1 = "ą"; $s2 = "\xC4\x85"; echo str_replace($s2,'ą',$s1); ?> Expected result: ą Got: ą test setup (php6.0-200705190630) uses trimmed php.ini with only unicode.semantics=on setting unicode.fallback_encoding - no value unicode.filesystem_encoding - no value unicode.http_input_encoding - no value unicode.output_encoding - no value unicode.runtime_encoding - no value unicode.script_encoding - no value unicode.semantics - On unicode.stream_encoding - UTF-8 -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php