Req #52923 [Com]: parse_url corrupts some UTF-8 strings

bugsphpnet at lumental dot com Thu, 11 Oct 2012 13:51:47 -0700

Edit report at https://bugs.php.net/bug.php?id=52923&edit=1


 ID:                 52923
 Comment by:         bugsphpnet at lumental dot com
 Reported by:        masteram at gmail dot com
 Summary:            parse_url corrupts some UTF-8 strings
 Status:             Open
 Type:               Feature/Change Request
 Package:            URL related
 Operating System:   MS Windows XP
 PHP Version:        5.3.3
 Block user comment: N
 Private report:     N

 New Comment:

On our Debian 4.3.2-1.1 server, changing the locale from LANG=en_US to 
LANG=en_US.UTF-8 seems to have fixed this problem.

In my opinion, parse_url() should treat all extended characters (octets 80-FF) 
as 
opaque characters and copy them as-is without modification.   Then, the 
function 
will work fine for both utf-8 and iso-8859-1 strings.  The behaviour of 
parse_url() should not depend on the LANG setting.  In my opinion, this 
function 
is buggy.


Previous Comments:
------------------------------------------------------------------------
[2010-12-08 22:15:23] dextercowley at gmail dot com

This issue seems to be platform dependent. For example, on Windows Vista with 
PHP 5.3.1, parse_url('http://mydomain.com/path/é') returns $array['path'] = 
"/path/". However, on a MAC, it works correctly and returns "/path/é". 

We can work around it by uuencoding each part of the array and then decoding 
the various legal URL characters ("/", ":", "&", and so on) before running 
parse_url, then decoding the path. However, a parse_url_utf8 function would be 
very convenient and probably faster. Thanks.

------------------------------------------------------------------------
[2010-09-26 09:46:39] [email protected]

The problem is that nothing guarantees a percent-encoded URL should be 
interpreted as containing UTF-8 data or that an (invalid) URL containing 
non-encoded unreserved characters should be converted to UTF-8 before being 
percent-encoded.

In fact, while most browsers will use UTF-8 to build URLs entered in the 
address bar, in case of HTML anchors in HTML pages, they will prefer to use the 
encoding of the page instead if it's also an ASCII superset.


That said, the corruption you describe seems uncalled for. In fact, I am unable 
to reproduce it. This is the value of $url I get in the end:

string(32) "/he/×¤×¨×××§×××/ByYear.html"

------------------------------------------------------------------------
[2010-09-25 16:22:19] masteram at gmail dot com

I tend to agree with Pajoye.
Although RFC-3986 calls for the use of percent-encoding for URLs, I believe 
that it also mentions the IDN format (and the way things look today, there is a 
host of websites that use UTF-8 encoding, which benefits the readability of 
internationalized urls). 
I admit not being an expert in URL encoding, but it seems to me that corrupting 
a string, even if it does not meet the current standards, is a bad habit.
In addition, utf-8 encoded URLs seem to be quite common on reality. Take the 
international versions of Wikipedia as an example.
If I'm wrong about that, I would be more than happy to know it.

I am not sure that the encode-analyze-merge-decode procedure is really the best 
choice. Perhaps the streamlined alternative should be considered. It sure 
wouldn't hurt.
I, for one, am currently using 'ASCII-only' URLs.

------------------------------------------------------------------------
[2010-09-25 14:34:34] [email protected]

It is not a bogus request. The idea would also to get the decoded (to UTF-8) 
URL elements as result. It is also a good complement to IDN support

------------------------------------------------------------------------
[2010-09-25 14:19:40] [email protected]

I'd say this request/bug is bogus because such URL is not valid according to 
RFC 3986. He should first percent-encode all the characters that are unreserved 
(perhaps after doing some unicode normalization) and only then parse the URL.

------------------------------------------------------------------------


The remainder of the comments for this report are too long. To view
the rest of the comments, please view the bug report online at

    https://bugs.php.net/bug.php?id=52923


-- 
Edit this bug report at https://bugs.php.net/bug.php?id=52923&edit=1

Req #52923 [Com]: parse_url corrupts some UTF-8 strings

Reply via email to