Edit report at http://bugs.php.net/bug.php?id=52923&edit=1

 ID:                 52923
 Updated by:         paj...@php.net
 Reported by:        masteram at gmail dot com
 Summary:            parse_url corrupts some UTF-8 strings
 Status:             Open
 Type:               Feature/Change Request
 Package:            *URL Functions
 Operating System:   MS Windows XP
 PHP Version:        5.3.3
 Block user comment: N

 New Comment:

It is not a bogus request. The idea would also to get the decoded (to
UTF-8) URL elements as result. It is also a good complement to IDN
support


Previous Comments:
------------------------------------------------------------------------
[2010-09-25 14:19:40] cataphr...@php.net

I'd say this request/bug is bogus because such URL is not valid
according to RFC 3986. He should first percent-encode all the characters
that are unreserved (perhaps after doing some unicode normalization) and
only then parse the URL.

------------------------------------------------------------------------
[2010-09-25 12:15:15] paj...@php.net

What's about a parse_url_utf8, like what we have for IDN? It could be
easy to implement it using either native OS APIs (when available) or
using external libraries (there is a couple of good one out there).

------------------------------------------------------------------------
[2010-09-25 11:42:29] ras...@php.net

Reclassifying as a feature request.  parse_url has never been
multibyte-aware.

------------------------------------------------------------------------
[2010-09-25 11:09:39] masteram at gmail dot com

Description:
------------
I have tested this with PHP 5.2.9 and 5.3.3.

Some UTF-8 strings are not being processed correctly by parse_url.

In the given example, the result of the evaluation of strings which
contains the chars 'ם' or 'א' is corrupt, whereas the string
'מישהו'(which does not contain the above chars) is being processed
correctly.

The affected characters (in UTF-8) are comprised of the following
bytes:

ם - d7|9d

א - d7|90



Those are converted to a char which contains the following bytes:
d7|5f.



In addition to ruining the url, this char is not safe with
preg_replace.

Therefore, if we merge the result of parse_url back into a string, and
then attempting to replace, say, spaces with underscores using
preg_replace, we will get an empty string.



I believe that this is similar to bug #26391.

Test script:
---------------
$url = 'http://www.mysite.org/he/פרויקטים/ByYear.html';

$url = parse_url($url); //$url['path'] is now corrupt



$url = preg_replace('/\s+/u','_',$url['path']); //$url is now undefined

Expected result:
----------------
The correct portion of the url.

Actual result:
--------------
Corrupt string (or blank after using preg_replace).


------------------------------------------------------------------------



-- 
Edit this bug report at http://bugs.php.net/bug.php?id=52923&edit=1

Reply via email to