---------- Forwarded message ---------
From: youkidearitai <youkideari...@gmail.com>
Date: 2025年3月20日(木) 14:41
Subject: Re: [PHP-DEV] Potential RFC: mb_rawurlencode() ?
To: Paul M. Jones <pmjo...@pmjones.io>


2025年3月19日(水) 2:52 Paul M. Jones <pmjo...@pmjones.io>:
>
> Hi all,
>
> The discussion around WHATWG-URL on this list, as well as my work 
> coordinating Uri-Interop <https://github.com/uri-interop/interface>, lead me 
> to think PHP needs a multibyte equivalent of rawurlencode().
>
> Broadly speaking, as far as I can tell:
>
> - For an RFC 3986 URI, delimiters need to be percent-encoded, as well as 
> non-ASCII characters.
> - For an RFC 3987 IRI, delimiters need to be percent-encoded, but UCS 
> characters do not.
>
> (There are other details but I think you get the idea.)
>
> The rawurlencode() function does fine for URIs, but not for IRIs. Using 
> rawurlencode() for an IRI will encode multibyte characters when it should 
> leave them alone. For example:
>
> ```
> $val = 'fü bar';
>
> $uriPath = '/heads/' . rawurlencode($val) . '/tails/';
> assert($uriPath === '/heads/f%C3%BC%20bar/tails/'); // true
>
> $iriPath = '/heads/' . rawurlencode($val) . '/tails/');
> assert($iriPath === '/heads/fü bar/tails/'; // false
> ```
>
> (This might apply to WHATWG-URL component construction as well.)
>
> Have I missed something, either in the specs or in PHP itself?
>
> If not, how do we feel about an RFC for mb_rawurlencode()? A naive userland 
> implementation might look something like the code below.
>
> Thoughts?
>
> * * *
>
> ```php
> function mb_rawurlencode(string $string) : string
> {
>     $encoded = '';
>
>     foreach (mb_str_split($string) as $char) {
>         $encoded .= match ($char) {
>             chr(0) => "%00",
>             chr(1) => "%01",
>             chr(2) => "%02",
>             chr(3) => "%03",
>             chr(4) => "%04",
>             chr(5) => "%05",
>             chr(6) => "%06",
>             chr(7) => "%07",
>             chr(8) => "%08",
>             chr(9) => "%09",
>             chr(10) => "%0A",
>             chr(11) => "%0B",
>             chr(12) => "%0C",
>             chr(13) => "%0D",
>             chr(14) => "%0E",
>             chr(15) => "%0F",
>             chr(16) => "%10",
>             chr(17) => "%11",
>             chr(18) => "%12",
>             chr(19) => "%13",
>             chr(20) => "%14",
>             chr(21) => "%15",
>             chr(22) => "%16",
>             chr(23) => "%17",
>             chr(24) => "%18",
>             chr(25) => "%19",
>             chr(26) => "%1A",
>             chr(27) => "%1B",
>             chr(28) => "%1C",
>             chr(29) => "%1D",
>             chr(30) => "%1E",
>             chr(31) => "%1F",
>             chr(127) => "%7F",
>             "!" => '%21',
>             "#" => '%23',
>             "$" => '%24',
>             "%" => '%25',
>             "&" => '%26',
>             "'" => '%27',
>             "(" => '%28',
>             ")" => '%29',
>             "*" => '%2A',
>             "+" => '%2B',
>             "," => '%2C',
>             "/" => '%2F',
>             ":" => '%3A',
>             ";" => '%3B',
>             "=" => '%3D',
>             "?" => '%3F',
>             "[" => '%5B',
>             "]" => '%5D',
>             default => $char,
>         };
>     }
>
>     return $encoded;
> }
> ```
>
> * * *
>
>
> -- pmj

Hi, Paul.

I think signature is below:

```php
function mb_rawurlencode(string $string, string $encode): string {}
```

Because the mbstring function is other than Unicode (ISO-8859-1 to
ISO-8859-16, Shift_JIS, EUC-* etc).
Other than that I don't know yet


Oops, I missing to send to internals.
Sorry resend this is.

Yuya

-- 
---------------------------
Yuya Hamada (tekimen)
- https://tekitoh-memdhoi.info
- https://github.com/youkidearitai
-----------------------------

Reply via email to