Hi all, The discussion around WHATWG-URL on this list, as well as my work coordinating Uri-Interop <https://github.com/uri-interop/interface>, lead me to think PHP needs a multibyte equivalent of rawurlencode().
Broadly speaking, as far as I can tell: - For an RFC 3986 URI, delimiters need to be percent-encoded, as well as non-ASCII characters. - For an RFC 3987 IRI, delimiters need to be percent-encoded, but UCS characters do not. (There are other details but I think you get the idea.) The rawurlencode() function does fine for URIs, but not for IRIs. Using rawurlencode() for an IRI will encode multibyte characters when it should leave them alone. For example: ``` $val = 'fü bar'; $uriPath = '/heads/' . rawurlencode($val) . '/tails/'; assert($uriPath === '/heads/f%C3%BC%20bar/tails/'); // true $iriPath = '/heads/' . rawurlencode($val) . '/tails/'); assert($iriPath === '/heads/fü bar/tails/'; // false ``` (This might apply to WHATWG-URL component construction as well.) Have I missed something, either in the specs or in PHP itself? If not, how do we feel about an RFC for mb_rawurlencode()? A naive userland implementation might look something like the code below. Thoughts? * * * ```php function mb_rawurlencode(string $string) : string { $encoded = ''; foreach (mb_str_split($string) as $char) { $encoded .= match ($char) { chr(0) => "%00", chr(1) => "%01", chr(2) => "%02", chr(3) => "%03", chr(4) => "%04", chr(5) => "%05", chr(6) => "%06", chr(7) => "%07", chr(8) => "%08", chr(9) => "%09", chr(10) => "%0A", chr(11) => "%0B", chr(12) => "%0C", chr(13) => "%0D", chr(14) => "%0E", chr(15) => "%0F", chr(16) => "%10", chr(17) => "%11", chr(18) => "%12", chr(19) => "%13", chr(20) => "%14", chr(21) => "%15", chr(22) => "%16", chr(23) => "%17", chr(24) => "%18", chr(25) => "%19", chr(26) => "%1A", chr(27) => "%1B", chr(28) => "%1C", chr(29) => "%1D", chr(30) => "%1E", chr(31) => "%1F", chr(127) => "%7F", "!" => '%21', "#" => '%23', "$" => '%24', "%" => '%25', "&" => '%26', "'" => '%27', "(" => '%28', ")" => '%29', "*" => '%2A', "+" => '%2B', "," => '%2C', "/" => '%2F', ":" => '%3A', ";" => '%3B', "=" => '%3D', "?" => '%3F', "[" => '%5B', "]" => '%5D', default => $char, }; } return $encoded; } ``` * * * -- pmj