Edit report at http://bugs.php.net/bug.php?id=52923&edit=1
ID: 52923 Updated by: paj...@php.net Reported by: masteram at gmail dot com Summary: parse_url corrupts some UTF-8 strings Status: Open Type: Feature/Change Request Package: *URL Functions Operating System: MS Windows XP PHP Version: 5.3.3 Block user comment: N New Comment: It is not a bogus request. The idea would also to get the decoded (to UTF-8) URL elements as result. It is also a good complement to IDN support Previous Comments: ------------------------------------------------------------------------ [2010-09-25 14:19:40] cataphr...@php.net I'd say this request/bug is bogus because such URL is not valid according to RFC 3986. He should first percent-encode all the characters that are unreserved (perhaps after doing some unicode normalization) and only then parse the URL. ------------------------------------------------------------------------ [2010-09-25 12:15:15] paj...@php.net What's about a parse_url_utf8, like what we have for IDN? It could be easy to implement it using either native OS APIs (when available) or using external libraries (there is a couple of good one out there). ------------------------------------------------------------------------ [2010-09-25 11:42:29] ras...@php.net Reclassifying as a feature request. parse_url has never been multibyte-aware. ------------------------------------------------------------------------ [2010-09-25 11:09:39] masteram at gmail dot com Description: ------------ I have tested this with PHP 5.2.9 and 5.3.3. Some UTF-8 strings are not being processed correctly by parse_url. In the given example, the result of the evaluation of strings which contains the chars '×' or '×' is corrupt, whereas the string '××ש××'(which does not contain the above chars) is being processed correctly. The affected characters (in UTF-8) are comprised of the following bytes: × - d7|9d × - d7|90 Those are converted to a char which contains the following bytes: d7|5f. In addition to ruining the url, this char is not safe with preg_replace. Therefore, if we merge the result of parse_url back into a string, and then attempting to replace, say, spaces with underscores using preg_replace, we will get an empty string. I believe that this is similar to bug #26391. Test script: --------------- $url = 'http://www.mysite.org/he/פר×××§×××/ByYear.html'; $url = parse_url($url); //$url['path'] is now corrupt $url = preg_replace('/\s+/u','_',$url['path']); //$url is now undefined Expected result: ---------------- The correct portion of the url. Actual result: -------------- Corrupt string (or blank after using preg_replace). ------------------------------------------------------------------------ -- Edit this bug report at http://bugs.php.net/bug.php?id=52923&edit=1