Req #52923 [Opn]: parse_url corrupts some UTF-8 strings

masteram at gmail dot com Sat, 25 Sep 2010 07:22:29 -0700

Edit report at http://bugs.php.net/bug.php?id=52923&edit=1


 ID:                 52923
 User updated by:    masteram at gmail dot com
 Reported by:        masteram at gmail dot com
 Summary:            parse_url corrupts some UTF-8 strings
 Status:             Open
 Type:               Feature/Change Request
 Package:            *URL Functions
 Operating System:   MS Windows XP
 PHP Version:        5.3.3
 Block user comment: N

 New Comment:

I tend to agree with Pajoye.

Although RFC-3986 calls for the use of percent-encoding for URLs, I
believe that it also mentions the IDN format (and the way things look
today, there is a host of websites that use UTF-8 encoding, which
benefits the readability of internationalized urls). 

I admit not being an expert in URL encoding, but it seems to me that
corrupting a string, even if it does not meet the current standards, is
a bad habit.

In addition, utf-8 encoded URLs seem to be quite common on reality. Take
the international versions of Wikipedia as an example.

If I'm wrong about that, I would be more than happy to know it.



I am not sure that the encode-analyze-merge-decode procedure is really
the best choice. Perhaps the streamlined alternative should be
considered. It sure wouldn't hurt.

I, for one, am currently using 'ASCII-only' URLs.


Previous Comments:
------------------------------------------------------------------------
[2010-09-25 14:34:34] paj...@php.net

It is not a bogus request. The idea would also to get the decoded (to
UTF-8) URL elements as result. It is also a good complement to IDN
support

------------------------------------------------------------------------
[2010-09-25 14:19:40] cataphr...@php.net

I'd say this request/bug is bogus because such URL is not valid
according to RFC 3986. He should first percent-encode all the characters
that are unreserved (perhaps after doing some unicode normalization) and
only then parse the URL.

------------------------------------------------------------------------
[2010-09-25 12:15:15] paj...@php.net

What's about a parse_url_utf8, like what we have for IDN? It could be
easy to implement it using either native OS APIs (when available) or
using external libraries (there is a couple of good one out there).

------------------------------------------------------------------------
[2010-09-25 11:42:29] ras...@php.net

Reclassifying as a feature request.  parse_url has never been
multibyte-aware.

------------------------------------------------------------------------
[2010-09-25 11:09:39] masteram at gmail dot com

Description:
------------
I have tested this with PHP 5.2.9 and 5.3.3.

Some UTF-8 strings are not being processed correctly by parse_url.

In the given example, the result of the evaluation of strings which
contains the chars '×' or '×' is corrupt, whereas the string
'×××©××'(which does not contain the above chars) is being processed
correctly.

The affected characters (in UTF-8) are comprised of the following
bytes:

× - d7|9d

× - d7|90



Those are converted to a char which contains the following bytes:
d7|5f.



In addition to ruining the url, this char is not safe with
preg_replace.

Therefore, if we merge the result of parse_url back into a string, and
then attempting to replace, say, spaces with underscores using
preg_replace, we will get an empty string.



I believe that this is similar to bug #26391.

Test script:
---------------
$url = 'http://www.mysite.org/he/×¤×¨×××§×××/ByYear.html';

$url = parse_url($url); //$url['path'] is now corrupt



$url = preg_replace('/\s+/u','_',$url['path']); //$url is now undefined

Expected result:
----------------
The correct portion of the url.

Actual result:
--------------
Corrupt string (or blank after using preg_replace).


------------------------------------------------------------------------



-- 
Edit this bug report at http://bugs.php.net/bug.php?id=52923&edit=1

Req #52923 [Opn]: parse_url corrupts some UTF-8 strings

Reply via email to