Edit report at https://bugs.php.net/bug.php?id=63732&edit=1
ID: 63732 User updated by: jmichae3 at yahoo dot com Reported by: jmichae3 at yahoo dot com Summary: unicode strings not handled correctly Status: Not a bug Type: Bug Package: Scripting Engine problem Operating System: linux PHP Version: 5.3.19 Block user comment: N Private report: N New Comment: if you were to take the time to do the research, there is no function in PHP except ord() for converting a character [from a string] to a number. maybe strings need to be handled differently internally in php to handle UNICODE. or maybe ord simply needs to be rewritten so it works so matter what character encoding is thrown at it. it would be difficult, but extremely useful, since it is the only function. I took the time to look through the mb functions. there was nothing to help me. I tried looking through the mb functions, there wasn't a compare. there wasn't a way to compare. I consider a function like that to be crucial if relops are not safe or capable of doing it. if that is the case, please make one, and an mb function for returning the ordinal value of an mb char. the functionality is just not there. thanks. much appreciated. unicode/mb-related bug database stuff: https://bugs.php.net/bug.php?id=49439 https://bugs.php.net/bug.php?id=63732 just search the database for anything with mb_encode or unicode. there are a number of bugs related to this problem. Previous Comments: ------------------------------------------------------------------------ [2012-12-11 22:22:24] ras...@php.net This is a bug reporting system. You reported a bug on a function that is behaving as intended and as documented. This is not a support forum. There are plenty of ways to do what you need. Start by reading about the mbstring functions. ------------------------------------------------------------------------ [2012-12-11 17:22:40] jmichae3 at yahoo dot com it may be documented behavior, but it still doesn't provide a solution to the problem. ------------------------------------------------------------------------ [2012-12-10 02:24:33] ahar...@php.net PHP strings are effectively byte arrays, and ord() only looks at the first byte. This is documented behaviour. ------------------------------------------------------------------------ [2012-12-09 07:38:05] jmichae3 at yahoo dot com Description: ------------ I am getting russian characters in my meail forms. I want to compare the characters to see if they are > '~' which is the last visible character in the ascii character set. this comparison does not work. in UNICODE, these characters are about 1024, and ~ is 126 according to ord(). ord() thinks EVERY character is ascii. this is far from true. there are mb characters from utf8. this is russian random characters from charmap: ÐÏÐγÏÐÐÐЫЫÐÐÐÑмдп in fact, I don't have any working way to detect whether a character is KOI8-R or ASCII, or cyrillic, or whether the character ordinal number is actually beyond 127 or not. because according to ord(), it's all within 0-255. Test script: --------------- <?php $s="п"; //russian character echo substr_compare($s,"~",0,1); echo "\n"; $i=0; for ($i=0; $i < strlen($s); $i++) { if (substr_compare($s[$i],"~",0,1) > 0) { echo "OK"; } else { echo "fail"; } if (ord($s[$i]) > 126) { echo "OK"; } else { echo "fail"; } if ($s[$i] > '~') { echo "OK"; } else { echo "fail"; } echo ord($s[$i]); } echo "\n"; $i=0; /* strangely enough, I get 2 outputs with only 1 character. Sat 12/08/2012 23:12:46.76||E:\www\jimm|>php t.php 1 OKOKOK208OKOKOK191 Sat 12/08/2012 23:14:27.34||E:\www\jimm|> */ ?> Expected result: ---------------- whole characters as a single unit. 1 result. Actual result: -------------- got 2 results from 1 UNICODE russian character in a string. should only get 1. this file was encoded with utf8 without bom. php is splitting the utf8 characters into a byte stream when it gets to strlen(). or it just treats unicode and utf8 characters like ascii. this does not work well when trying to use mb_detect_encoding() - that breaks ability to detect encodings when it breaks up characters like that. nearly everything with strings actually. this also breaks ability to detect foreign spam. ------------------------------------------------------------------------ -- Edit this bug report at https://bugs.php.net/bug.php?id=63732&edit=1