>> 0xC4 and 0x85 are hex codes for latin small letter a with ogonek in >> utf-8. ą >> >> <?php >> var_dump("ą" == "\xC4\x85"); >> echo "ą\n"; >> echo "\xC4\x85"; >> ?> >> >> If script is written in utf-8, I expect bool(true) on var_dump() line. > > var_dump("ą" == b"\xC4\x85"); > > This will give you what you want, if the script is written in UTF-8 > and your runtime encoding is set to UTF-8. > >> <?php >> // example uses utf-8. similar code is used in iso-8859-2 - >> // iso-8859-16 decoding. utf-8 decoding does not need mapping tables >> // and is written in pcre. >> $s1 = "ą"; >> $s2 = "\xC4\x85"; >> echo str_replace($s2,'ą',$s1); >> ?> >> >> Expected result: ą >> Got: ą >> >> test setup (php6.0-200705190630) uses trimmed php.ini with only >> unicode.semantics=on setting >> >> unicode.fallback_encoding - no value >> unicode.filesystem_encoding - no value >> unicode.http_input_encoding - no value >> unicode.output_encoding - no value >> unicode.runtime_encoding - no value >> unicode.script_encoding - no value >> unicode.semantics - On >> unicode.stream_encoding - UTF-8 > > Why didn't you set any encoding settings?
They are not documented and I am testing configurations that might break scripts. If I test things and want to make code portable, configuration is not supposed to be rational. I can set option with ini_set(), if I understand what option does and it fixes the issue. http://www.php.net/unicode Do you have updated documentation version which explains encoding settings and lists available configuration values? Or am I testing PHP6 too early and you are still months or years away from 6.0.0 betas and rcs? Could you implement pseudo encoding similar to 'pass' encoding used in mbstring? Current implementation does not give controls needed by script writers. SquirrelMail scripts are not written in unicode. They are in ascii. If some 8bit value is used, it is always written in octal or hex notation. These hex values are not written in one character set. In some cases scripts use byte values. For example, locating first utf-8 byte or looking for 0x80-0xFF bytes in string. In other cases they are written in source or target character set. For example, iso-8859-2 decoding function contains array with iso-8859-2 hex values mapped to html codes. Code can't use raw 8bit strings, because they might be corrupted in misconfigured editor used by developer and it is very hard to track such corruption. 8bit data can come only from user input (composed emails and preferences, html forms, one common charset) and imap server (received emails, lots of different charsets and encodings). -- Tomas -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php