Edit report at http://bugs.php.net/bug.php?id=52923&edit=1
ID: 52923 User updated by: masteram at gmail dot com Reported by: masteram at gmail dot com Summary: parse_url corrupts some UTF-8 strings Status: Open Type: Feature/Change Request Package: *URL Functions Operating System: MS Windows XP PHP Version: 5.3.3 Block user comment: N New Comment: I tend to agree with Pajoye. Although RFC-3986 calls for the use of percent-encoding for URLs, I believe that it also mentions the IDN format (and the way things look today, there is a host of websites that use UTF-8 encoding, which benefits the readability of internationalized urls). I admit not being an expert in URL encoding, but it seems to me that corrupting a string, even if it does not meet the current standards, is a bad habit. In addition, utf-8 encoded URLs seem to be quite common on reality. Take the international versions of Wikipedia as an example. If I'm wrong about that, I would be more than happy to know it. I am not sure that the encode-analyze-merge-decode procedure is really the best choice. Perhaps the streamlined alternative should be considered. It sure wouldn't hurt. I, for one, am currently using 'ASCII-only' URLs. Previous Comments: ------------------------------------------------------------------------ [2010-09-25 14:34:34] paj...@php.net It is not a bogus request. The idea would also to get the decoded (to UTF-8) URL elements as result. It is also a good complement to IDN support ------------------------------------------------------------------------ [2010-09-25 14:19:40] cataphr...@php.net I'd say this request/bug is bogus because such URL is not valid according to RFC 3986. He should first percent-encode all the characters that are unreserved (perhaps after doing some unicode normalization) and only then parse the URL. ------------------------------------------------------------------------ [2010-09-25 12:15:15] paj...@php.net What's about a parse_url_utf8, like what we have for IDN? It could be easy to implement it using either native OS APIs (when available) or using external libraries (there is a couple of good one out there). ------------------------------------------------------------------------ [2010-09-25 11:42:29] ras...@php.net Reclassifying as a feature request. parse_url has never been multibyte-aware. ------------------------------------------------------------------------ [2010-09-25 11:09:39] masteram at gmail dot com Description: ------------ I have tested this with PHP 5.2.9 and 5.3.3. Some UTF-8 strings are not being processed correctly by parse_url. In the given example, the result of the evaluation of strings which contains the chars '×' or '×' is corrupt, whereas the string '××ש××'(which does not contain the above chars) is being processed correctly. The affected characters (in UTF-8) are comprised of the following bytes: × - d7|9d × - d7|90 Those are converted to a char which contains the following bytes: d7|5f. In addition to ruining the url, this char is not safe with preg_replace. Therefore, if we merge the result of parse_url back into a string, and then attempting to replace, say, spaces with underscores using preg_replace, we will get an empty string. I believe that this is similar to bug #26391. Test script: --------------- $url = 'http://www.mysite.org/he/פר×××§×××/ByYear.html'; $url = parse_url($url); //$url['path'] is now corrupt $url = preg_replace('/\s+/u','_',$url['path']); //$url is now undefined Expected result: ---------------- The correct portion of the url. Actual result: -------------- Corrupt string (or blank after using preg_replace). ------------------------------------------------------------------------ -- Edit this bug report at http://bugs.php.net/bug.php?id=52923&edit=1