> Disclaimer: I don't know much about the way unicode is implemented in
> php, i have only used it a bit, but i believe i can clear some things
> up here.
>
>> 0xC4 and 0x85 are hex codes for latin small letter a with ogonek in
>> utf-8. ą
>>
>> <?php
>> var_dump("ą" == "\xC4\x85");
>> echo "ą\n";
>> echo "\xC4\x85";
>> ?>
>>
>> If script is written in utf-8, I expect bool(true) on var_dump() line.
>
> You expect wrong things. "\xC4\x85" is a unicode string containing two
> codepoints, those at 0xC4 and 0x85 (LATIN CAPITAL LETTER A WITH
> DIAERESIS and NEXT LINE (NEL)), while "ą" is a unicode string
> containing one code point (0x0105, LATIN SMALL LETTER A WITH OGONEK)
> (see
> http://www.unicode.org/charts/PDF/U0080.pdf and
> http://www.unicode.org/charts/PDF/U0100.pdf). Different strings, so
> comparision should return false. If you want to type bytes, use the
> "b" prefix: b"\xC4\x85", and compare that with the binary version of
> your string literal. var_dump(b"ą" == b"\xC4\x85"); should give you
> bool(true) if your encoding is utf-8.

Latin capital letter A with diaeresis is 00C4. Not C4.

I wrote two 8bit values. Not two 16bit ones. Interpreter tries to outsmart
me and thinks that I want 00C4, when I write C4.

http://www.php.net/language.types.string
---
\x[0-9A-Fa-f]{1,2} - the sequence of characters matching the regular
expression is a character in hexadecimal notation
---
One or two alphanumerics after x. This escape is used to write 8bit
values. You can't write 16 bit Unicode characters with one escape.

And again you are suggesting me unportable solution.
Parse error: syntax error, unexpected T_CONSTANT_ENCAPSED_STRING in
test5.php on line 2

I don't want to maintain different script version for PHP6
unicode.semantics=on.

>> It
>> is bool(false), when unicode.semantics are turned on. Internal
>> SquirrelMail character set decoding functions write mapping tables in
>> hexadecimals or octals. In some cases they evaluate only byte value and
>> not whole symbol. Multibyte character set decoding can use recode, iconv
>> and mbstring, but most of single byte decoding is written in plain
>> string
>> functions and stores hex to html mapping tables in associative arrays.
>>
>> <?php
>> // example uses utf-8. similar code is used in iso-8859-2 -
>> // iso-8859-16 decoding. utf-8 decoding does not need mapping tables
>> // and is written in pcre.
>> $s1 = "ą";
>> $s2 = "\xC4\x85";
>> echo str_replace($s2,'&#261;',$s1);
>> ?>
>>
>> Expected result: &#261;
>> Got: ą
>
> Same thing. If you want binary replacements, use binary strings, not
> unicode strings.

mbstring.func_overload and unicode.semantics decisions must be made by
script writers and not by end users. That's why I asked for PHP_INI_ALL
level controls.

I'll wait for better documentation on unicode.*_encoding options and will
see what I can do with them.

-- 
Tomas

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to