Edit report at http://bugs.php.net/bug.php?id=52923&edit=1
ID: 52923 Updated by: ras...@php.net Reported by: masteram at gmail dot com Summary: parse_url corrupts some UTF-8 strings Status: Open -Type: Bug +Type: Feature/Change Request Package: *URL Functions Operating System: MS Windows XP PHP Version: 5.3.3 Block user comment: N New Comment: Reclassifying as a feature request. parse_url has never been multibyte-aware. Previous Comments: ------------------------------------------------------------------------ [2010-09-25 11:09:39] masteram at gmail dot com Description: ------------ I have tested this with PHP 5.2.9 and 5.3.3. Some UTF-8 strings are not being processed correctly by parse_url. In the given example, the result of the evaluation of strings which contains the chars '×' or '×' is corrupt, whereas the string '××ש××'(which does not contain the above chars) is being processed correctly. The affected characters (in UTF-8) are comprised of the following bytes: × - d7|9d × - d7|90 Those are converted to a char which contains the following bytes: d7|5f. In addition to ruining the url, this char is not safe with preg_replace. Therefore, if we merge the result of parse_url back into a string, and then attempting to replace, say, spaces with underscores using preg_replace, we will get an empty string. I believe that this is similar to bug #26391. Test script: --------------- $url = 'http://www.mysite.org/he/פר×××§×××/ByYear.html'; $url = parse_url($url); //$url['path'] is now corrupt $url = preg_replace('/\s+/u','_',$url['path']); //$url is now undefined Expected result: ---------------- The correct portion of the url. Actual result: -------------- Corrupt string (or blank after using preg_replace). ------------------------------------------------------------------------ -- Edit this bug report at http://bugs.php.net/bug.php?id=52923&edit=1