Hi all,

The discussion around WHATWG-URL on this list, as well as my work coordinating 
Uri-Interop <https://github.com/uri-interop/interface>, lead me to think PHP 
needs a multibyte equivalent of rawurlencode().

Broadly speaking, as far as I can tell:

- For an RFC 3986 URI, delimiters need to be percent-encoded, as well as 
non-ASCII characters.
- For an RFC 3987 IRI, delimiters need to be percent-encoded, but UCS 
characters do not.

(There are other details but I think you get the idea.)

The rawurlencode() function does fine for URIs, but not for IRIs. Using 
rawurlencode() for an IRI will encode multibyte characters when it should leave 
them alone. For example:

```
$val = 'fü bar';

$uriPath = '/heads/' . rawurlencode($val) . '/tails/';
assert($uriPath === '/heads/f%C3%BC%20bar/tails/'); // true

$iriPath = '/heads/' . rawurlencode($val) . '/tails/');
assert($iriPath === '/heads/fü bar/tails/'; // false
```

(This might apply to WHATWG-URL component construction as well.)

Have I missed something, either in the specs or in PHP itself?

If not, how do we feel about an RFC for mb_rawurlencode()? A naive userland 
implementation might look something like the code below.

Thoughts?

* * *

```php
function mb_rawurlencode(string $string) : string
{
    $encoded = '';

    foreach (mb_str_split($string) as $char) {
        $encoded .= match ($char) {
            chr(0) => "%00",
            chr(1) => "%01",
            chr(2) => "%02",
            chr(3) => "%03",
            chr(4) => "%04",
            chr(5) => "%05",
            chr(6) => "%06",
            chr(7) => "%07",
            chr(8) => "%08",
            chr(9) => "%09",
            chr(10) => "%0A",
            chr(11) => "%0B",
            chr(12) => "%0C",
            chr(13) => "%0D",
            chr(14) => "%0E",
            chr(15) => "%0F",
            chr(16) => "%10",
            chr(17) => "%11",
            chr(18) => "%12",
            chr(19) => "%13",
            chr(20) => "%14",
            chr(21) => "%15",
            chr(22) => "%16",
            chr(23) => "%17",
            chr(24) => "%18",
            chr(25) => "%19",
            chr(26) => "%1A",
            chr(27) => "%1B",
            chr(28) => "%1C",
            chr(29) => "%1D",
            chr(30) => "%1E",
            chr(31) => "%1F",
            chr(127) => "%7F",
            "!" => '%21',
            "#" => '%23',
            "$" => '%24',
            "%" => '%25',
            "&" => '%26',
            "'" => '%27',
            "(" => '%28',
            ")" => '%29',
            "*" => '%2A',
            "+" => '%2B',
            "," => '%2C',
            "/" => '%2F',
            ":" => '%3A',
            ";" => '%3B',
            "=" => '%3D',
            "?" => '%3F',
            "[" => '%5B',
            "]" => '%5D',
            default => $char,
        };
    }

    return $encoded;
}
```

* * *


-- pmj

Reply via email to