>> 0xC4 and 0x85 are hex codes for latin small letter a with ogonek in
>> utf-8. ą
>>
>> <?php
>> var_dump("ą" == "\xC4\x85");
>> echo "ą\n";
>> echo "\xC4\x85";
>> ?>
>>
>> If script is written in utf-8, I expect bool(true) on var_dump() line.
>
> var_dump("ą" == b"\xC4\x85");
>
> This will give you what you want, if the script is written in UTF-8
> and your runtime encoding is set to UTF-8.
>
>> <?php
>> // example uses utf-8. similar code is used in iso-8859-2 -
>> // iso-8859-16 decoding. utf-8 decoding does not need mapping tables
>> // and is written in pcre.
>> $s1 = "ą";
>> $s2 = "\xC4\x85";
>> echo str_replace($s2,'&#261;',$s1);
>> ?>
>>
>> Expected result: &#261;
>> Got: ą
>>
>> test setup (php6.0-200705190630) uses trimmed php.ini with only
>> unicode.semantics=on setting
>>
>> unicode.fallback_encoding - no value
>> unicode.filesystem_encoding - no value
>> unicode.http_input_encoding - no value
>> unicode.output_encoding - no value
>> unicode.runtime_encoding - no value
>> unicode.script_encoding - no value
>> unicode.semantics - On
>> unicode.stream_encoding - UTF-8
>
> Why didn't you set any encoding settings?

They are not documented and I am testing configurations that might break
scripts. If I test things and want to make code portable, configuration is
not supposed to be rational. I can set option with ini_set(), if I
understand what option does and it fixes the issue.

http://www.php.net/unicode

Do you have updated documentation version which explains encoding settings
and lists available configuration values? Or am I testing PHP6 too early
and you are still months or years away from 6.0.0 betas and rcs? Could you
implement pseudo encoding similar to 'pass' encoding used in mbstring?
Current implementation does not give controls needed by script writers.

SquirrelMail scripts are not written in unicode. They are in ascii. If
some 8bit value is used, it is always written in octal or hex notation.
These hex values are not written in one character set. In some cases
scripts use byte values. For example, locating first utf-8 byte or looking
for 0x80-0xFF bytes in string. In other cases they are written in source
or target character set. For example, iso-8859-2 decoding function
contains array with iso-8859-2 hex values mapped to html codes. Code can't
use raw 8bit strings, because they might be corrupted in misconfigured
editor used by developer and it is very hard to track such corruption.
8bit data can come only from user input (composed emails and preferences,
html forms, one common charset) and imap server (received emails, lots of
different charsets and encodings).


-- 
Tomas

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to