Re: [PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing API

Máté Kocsis Sun, 13 Apr 2025 05:12:43 -0700

Hi Tim,

I think I would prefer:
>
>      namespace Uri {
>          class InvalidUriException extends \Uri\UriException
>          {
>          }
>      }
>
>      namespace Uri\WhatWg {
>          class InvalidUrlException extends \Uri\InvalidUriException {
>              /** @var list<UrlValidationError> */
>              public readonly array $errors;
>          }
>      }
>
> (note the use of Url in the name of the sub-exception)
>
> While this would result in a little more boilerplate, it would make
> static analysis tools more useful, since the `$errors` array could be
> properly typed instead of being just `array<mixed>`.
>


OK, this makes sense to me, and I've just implemented it.


> > 7.
> >>
> >> In the “Component retrieval” section: Please add even more examples of
> >> what kind of percent-decoding will happen. For example, it's important
> >> to know if `%26` is decoded to `&` in a query-string. Or if `%3D` is
> >> decoded to `=`. This really is the same case as with `%2F` in a path.
> >> The explanation
> >>
> > […]
> > The relevant sections will give a little more reasoning why I went with
> > these rules.
>
> I've tested some of the examples against the implementation, but it does
> not match the description. Is the implementation up to date?
>
>      <?php
>
>      $url = new Uri\WhatWg\Url("https://example.com/foo/bar%2Fbaz";);
>
>      var_dump($url->getPath());                            //
> /foo/bar%2Fbaz
>      var_dump($url->getRawPath());                         //
> /foo/bar%2Fbaz
>
> results in:
>
>      string(12) "/foo/bar/baz"
>      string(14) "/foo/bar%2Fbaz"
>

Yes, it is currently up-to-date, but I made some changes in WHATWG encoding
not long ago and I didn't notice that
the chosen behavior negatively affects this case... Let me share the
details, because decoding of WHATWG
URLs seems very problematic.

Originally, my intention was to percent-decode characters based on the
individual components' "percent-encode set" (i.e.
https://url.spec.whatwg.org/#fragment-percent-encode-set for the fragment).
These are the characters that are
automatically percent-encoded when encountered. One of my problems with
this behavior was that characters in "percent-encode sets"
are not entirely in line with "URL code points" (basically valid characters
in an URL: https://url.spec.whatwg.org/#url-code-points).
Most notably, the "#", the "[", and "]" characters are present in some
percent-encoding sets, while missing from the valid URL
code points.

If characters were percent-decoded based on the "percent-encode sets", then
there would be some issues when the result is
passed to a wither: the WHATWG setter algorithms emit a soft error in these
cases (e.g. in case of the query string, the
https://url.spec.whatwg.org/#dom-url-search steps trigger
https://url.spec.whatwg.org/#query-state, where the 3.1. step takes
into action). To be fair, soft errors are not exposed in case of WHATWG
withers, so it's currently rather a theoretical problem
than an actual one (but I'm still considering adding a `$softErrors`
parameter to WHATWG withers).

In any case, I believe the end of the "Component modification section" of
the RFC shares some background information
regarding percent-decoding behavior.

At last, when I changed the RFC so that only those characters were
percent-decoded which were "URL code points", I didn't notice
that the example you referred to above would go outdated: as "/" is an URL
code point, it's currently percent-decoded by getPath().
Unfortunately, I still don't know what the best approach would be.


> Please also give an explicit example for `%3F` in a path. I know that it
> is reserved from reading the Rfc3986, but I think it's a little
> unintuitive. You can adjust the last example in the component retrieval
> section to make it show all cases. So:
>
>      $uri = new
> Uri\Rfc3986\Uri("https://
> [2001:0db8:0001:0000:0000:0ab9:C0A8:0102]/foo/bar%3Fbaz?foo=bar%26baz%3Dqux");
>
>      echo $uri->getHost();                           //
> [2001:0db8:0001:0000:0000:0ab9:C0A8:0102]
>      echo $uri->getRawHost();                        //
> [2001:0db8:0001:0000:0000:0ab9:C0A8:0102]
>      echo $uri->getPath();                           // /foo/bar%3Fbaz
>      echo $uri->getRawPath();                        // /foo/bar%3Fbaz
>      echo $uri->getQuery();                          //
> foo=bar%26baz%3Dqux
>      echo $uri->getRawQuery();                       //
> foo=bar%26baz%3Dqux
>

Why is this behavior unintuitive? I think the already added examples should
already make it clear that percent-encoded
characters are never percent-decoded (the component modification part also
has one example).


> During testing I also noticed that the Rfc3986 implementation removes
> trailing slashes from the path when using the normalized version. This
> was a little unexpected, because to me this is the difference between a
> directory and a file. I don't think there are clear examples showing
> that. So:
>
>      $uri = new Uri\Rfc3986\Uri("https://example.com/foo/bar/";);
>
>      echo $uri->getPath();     // /foo/bar
>      echo $uri->getRawPath();  // /foo/bar/
>

Yes, I agree it's weird. I'll have a look at the code again if the
normalizer removes the trailing slash, or I messed up something.


> >> In the “Component Modification” section, the RFC states that WhatWgUrl
> >> will automatically encode `?` and `#` as necessary. Will the same
> >> happen
> >> for Rfc3986? Will the encoding of `#` also happen for the query-string
> >> component? The RFC only mentions the path component.


I think the question for RFC 3986 is answered in the PHP RFC by the
following paragraph:

> In order to offer consistent behavior with the parsing rules of RFC 3986,
> withers of Uri\Rfc3986\Uri also only accept properly formatted input,
meaning characters
> that are not allowed to be present in a component must be
> percent-encoded. Let's see what this means in practice through the
following example

Effectively, RFC 3986 has different behavior than what WHATWG does.

The latter question ("Will the encoding of `#` also happen for the
query-string component?")
was supposed to be answered by the RFC, because of this sentence:

> WHATWG algorithm automatically percent-encodes characters that fall into
the percent-encoding
> character set of the given component

It may be possible that "the given" part is misleading, but the behavior
actually follows the WHATWG spec
for all components. In any case, I change a few words to make this clear.

Is the implementation already up to date with this change? When I try:
>
>      var_dump(
>         (new Uri\Rfc3986\Uri('https://example.com/foo/path'))
>                 ->withPath('some/path?foo=bar')
>                 ->toString()
>      );
>
> I get
>
>      string(36) "https://example.comsome/path?foo=bar";
>
> which is completely wrong.
>

I haven't completely implemented withers yet for RFC 3986 (first and
foremost validation is missing),
so that's why you experienced this behavior. I would fix this later, but
only if the vote succeeds. I've already
worked a lot on the implementation without having any promise of the RFC
to succeed.


> I think this might be a misunderstanding of the WHATWG specification. It
> seems to be also normalized during parsing:
>
> When I do the following in my Google Chrome:
>
>      (new URL('https://[0:0::1]')).host;
>
> I get `[::1]`, which indicates the normalization happening. And likewise
> will:
>
>      (new URL('https://[2001:db8:0:0:0:0:0:1]')).host;
>
> result in `[2001:db8::1]`.
>

Yes, I realized that you are right. IP6 support used to be indeed
incomplete or buggy until now,
but I took some time, and corrected the behavior.


> My expectation be be `[2001:db8:0:0:0:0:0:1]` for Rfc3986 and
> `[2001:db8::1]` for WhatWg. I have also tested the behavior of
> `withHost()` when leaving out the square brackets. The Rfc3986 correctly
> throws an Exception, but WhatWg silently does nothing:
>
>      $url = 'https://example.com/foo/path';
>
>      var_dump((new
> Uri\WhatWg\Url($url))->withHost('2001:db8:0:0:0:0:0:1')->toAsciiString());
>
> results in
>
>      string(28) "https://example.com/foo/path";
>

This looks like this is the result of WHATWG's host setter algorithm (
https://url.spec.whatwg.org/#dom-url-hostname).
After debugging the behavior, I noticed that "new
Uri\WhatWg\Url('2001:db8:0:0:0:0:0:1')" only fails when trying to parse
the port after the first ":" character. However, the setter algorithm
obviously doesn't reach this point, since it only tries to
parse the host, and then it stops (because of the state override). So I'm
not sure this gotcha can be cured.

I tried to reproduce the problem in Chrome, but I realized that the URL
properties are not validated at all
when they are set ("url.hostname = "2001:db8:0:0:0:0:0:1";" will change the
hostname no problem)...

Regards,
Máté

Re: [PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing API

Reply via email to