---------- Forwarded message --------- From: youkidearitai <youkideari...@gmail.com> Date: 2025年3月20日(木) 14:41 Subject: Re: [PHP-DEV] Potential RFC: mb_rawurlencode() ? To: Paul M. Jones <pmjo...@pmjones.io>
2025年3月19日(水) 2:52 Paul M. Jones <pmjo...@pmjones.io>: > > Hi all, > > The discussion around WHATWG-URL on this list, as well as my work > coordinating Uri-Interop <https://github.com/uri-interop/interface>, lead me > to think PHP needs a multibyte equivalent of rawurlencode(). > > Broadly speaking, as far as I can tell: > > - For an RFC 3986 URI, delimiters need to be percent-encoded, as well as > non-ASCII characters. > - For an RFC 3987 IRI, delimiters need to be percent-encoded, but UCS > characters do not. > > (There are other details but I think you get the idea.) > > The rawurlencode() function does fine for URIs, but not for IRIs. Using > rawurlencode() for an IRI will encode multibyte characters when it should > leave them alone. For example: > > ``` > $val = 'fü bar'; > > $uriPath = '/heads/' . rawurlencode($val) . '/tails/'; > assert($uriPath === '/heads/f%C3%BC%20bar/tails/'); // true > > $iriPath = '/heads/' . rawurlencode($val) . '/tails/'); > assert($iriPath === '/heads/fü bar/tails/'; // false > ``` > > (This might apply to WHATWG-URL component construction as well.) > > Have I missed something, either in the specs or in PHP itself? > > If not, how do we feel about an RFC for mb_rawurlencode()? A naive userland > implementation might look something like the code below. > > Thoughts? > > * * * > > ```php > function mb_rawurlencode(string $string) : string > { > $encoded = ''; > > foreach (mb_str_split($string) as $char) { > $encoded .= match ($char) { > chr(0) => "%00", > chr(1) => "%01", > chr(2) => "%02", > chr(3) => "%03", > chr(4) => "%04", > chr(5) => "%05", > chr(6) => "%06", > chr(7) => "%07", > chr(8) => "%08", > chr(9) => "%09", > chr(10) => "%0A", > chr(11) => "%0B", > chr(12) => "%0C", > chr(13) => "%0D", > chr(14) => "%0E", > chr(15) => "%0F", > chr(16) => "%10", > chr(17) => "%11", > chr(18) => "%12", > chr(19) => "%13", > chr(20) => "%14", > chr(21) => "%15", > chr(22) => "%16", > chr(23) => "%17", > chr(24) => "%18", > chr(25) => "%19", > chr(26) => "%1A", > chr(27) => "%1B", > chr(28) => "%1C", > chr(29) => "%1D", > chr(30) => "%1E", > chr(31) => "%1F", > chr(127) => "%7F", > "!" => '%21', > "#" => '%23', > "$" => '%24', > "%" => '%25', > "&" => '%26', > "'" => '%27', > "(" => '%28', > ")" => '%29', > "*" => '%2A', > "+" => '%2B', > "," => '%2C', > "/" => '%2F', > ":" => '%3A', > ";" => '%3B', > "=" => '%3D', > "?" => '%3F', > "[" => '%5B', > "]" => '%5D', > default => $char, > }; > } > > return $encoded; > } > ``` > > * * * > > > -- pmj Hi, Paul. I think signature is below: ```php function mb_rawurlencode(string $string, string $encode): string {} ``` Because the mbstring function is other than Unicode (ISO-8859-1 to ISO-8859-16, Shift_JIS, EUC-* etc). Other than that I don't know yet Oops, I missing to send to internals. Sorry resend this is. Yuya -- --------------------------- Yuya Hamada (tekimen) - https://tekitoh-memdhoi.info - https://github.com/youkidearitai -----------------------------