Hi Tim, I think I would prefer: > > namespace Uri { > class InvalidUriException extends \Uri\UriException > { > } > } > > namespace Uri\WhatWg { > class InvalidUrlException extends \Uri\InvalidUriException { > /** @var list<UrlValidationError> */ > public readonly array $errors; > } > } > > (note the use of Url in the name of the sub-exception) > > While this would result in a little more boilerplate, it would make > static analysis tools more useful, since the `$errors` array could be > properly typed instead of being just `array<mixed>`. >
OK, this makes sense to me, and I've just implemented it. > > 7. > >> > >> In the “Component retrieval” section: Please add even more examples of > >> what kind of percent-decoding will happen. For example, it's important > >> to know if `%26` is decoded to `&` in a query-string. Or if `%3D` is > >> decoded to `=`. This really is the same case as with `%2F` in a path. > >> The explanation > >> > > […] > > The relevant sections will give a little more reasoning why I went with > > these rules. > > I've tested some of the examples against the implementation, but it does > not match the description. Is the implementation up to date? > > <?php > > $url = new Uri\WhatWg\Url("https://example.com/foo/bar%2Fbaz"); > > var_dump($url->getPath()); // > /foo/bar%2Fbaz > var_dump($url->getRawPath()); // > /foo/bar%2Fbaz > > results in: > > string(12) "/foo/bar/baz" > string(14) "/foo/bar%2Fbaz" > Yes, it is currently up-to-date, but I made some changes in WHATWG encoding not long ago and I didn't notice that the chosen behavior negatively affects this case... Let me share the details, because decoding of WHATWG URLs seems very problematic. Originally, my intention was to percent-decode characters based on the individual components' "percent-encode set" (i.e. https://url.spec.whatwg.org/#fragment-percent-encode-set for the fragment). These are the characters that are automatically percent-encoded when encountered. One of my problems with this behavior was that characters in "percent-encode sets" are not entirely in line with "URL code points" (basically valid characters in an URL: https://url.spec.whatwg.org/#url-code-points). Most notably, the "#", the "[", and "]" characters are present in some percent-encoding sets, while missing from the valid URL code points. If characters were percent-decoded based on the "percent-encode sets", then there would be some issues when the result is passed to a wither: the WHATWG setter algorithms emit a soft error in these cases (e.g. in case of the query string, the https://url.spec.whatwg.org/#dom-url-search steps trigger https://url.spec.whatwg.org/#query-state, where the 3.1. step takes into action). To be fair, soft errors are not exposed in case of WHATWG withers, so it's currently rather a theoretical problem than an actual one (but I'm still considering adding a `$softErrors` parameter to WHATWG withers). In any case, I believe the end of the "Component modification section" of the RFC shares some background information regarding percent-decoding behavior. At last, when I changed the RFC so that only those characters were percent-decoded which were "URL code points", I didn't notice that the example you referred to above would go outdated: as "/" is an URL code point, it's currently percent-decoded by getPath(). Unfortunately, I still don't know what the best approach would be. > Please also give an explicit example for `%3F` in a path. I know that it > is reserved from reading the Rfc3986, but I think it's a little > unintuitive. You can adjust the last example in the component retrieval > section to make it show all cases. So: > > $uri = new > Uri\Rfc3986\Uri("https:// > [2001:0db8:0001:0000:0000:0ab9:C0A8:0102]/foo/bar%3Fbaz?foo=bar%26baz%3Dqux"); > > echo $uri->getHost(); // > [2001:0db8:0001:0000:0000:0ab9:C0A8:0102] > echo $uri->getRawHost(); // > [2001:0db8:0001:0000:0000:0ab9:C0A8:0102] > echo $uri->getPath(); // /foo/bar%3Fbaz > echo $uri->getRawPath(); // /foo/bar%3Fbaz > echo $uri->getQuery(); // > foo=bar%26baz%3Dqux > echo $uri->getRawQuery(); // > foo=bar%26baz%3Dqux > Why is this behavior unintuitive? I think the already added examples should already make it clear that percent-encoded characters are never percent-decoded (the component modification part also has one example). > During testing I also noticed that the Rfc3986 implementation removes > trailing slashes from the path when using the normalized version. This > was a little unexpected, because to me this is the difference between a > directory and a file. I don't think there are clear examples showing > that. So: > > $uri = new Uri\Rfc3986\Uri("https://example.com/foo/bar/"); > > echo $uri->getPath(); // /foo/bar > echo $uri->getRawPath(); // /foo/bar/ > Yes, I agree it's weird. I'll have a look at the code again if the normalizer removes the trailing slash, or I messed up something. > >> In the “Component Modification” section, the RFC states that WhatWgUrl > >> will automatically encode `?` and `#` as necessary. Will the same > >> happen > >> for Rfc3986? Will the encoding of `#` also happen for the query-string > >> component? The RFC only mentions the path component. I think the question for RFC 3986 is answered in the PHP RFC by the following paragraph: > In order to offer consistent behavior with the parsing rules of RFC 3986, > withers of Uri\Rfc3986\Uri also only accept properly formatted input, meaning characters > that are not allowed to be present in a component must be > percent-encoded. Let's see what this means in practice through the following example Effectively, RFC 3986 has different behavior than what WHATWG does. The latter question ("Will the encoding of `#` also happen for the query-string component?") was supposed to be answered by the RFC, because of this sentence: > WHATWG algorithm automatically percent-encodes characters that fall into the percent-encoding > character set of the given component It may be possible that "the given" part is misleading, but the behavior actually follows the WHATWG spec for all components. In any case, I change a few words to make this clear. Is the implementation already up to date with this change? When I try: > > var_dump( > (new Uri\Rfc3986\Uri('https://example.com/foo/path')) > ->withPath('some/path?foo=bar') > ->toString() > ); > > I get > > string(36) "https://example.comsome/path?foo=bar" > > which is completely wrong. > I haven't completely implemented withers yet for RFC 3986 (first and foremost validation is missing), so that's why you experienced this behavior. I would fix this later, but only if the vote succeeds. I've already worked a lot on the implementation without having any promise of the RFC to succeed. > I think this might be a misunderstanding of the WHATWG specification. It > seems to be also normalized during parsing: > > When I do the following in my Google Chrome: > > (new URL('https://[0:0::1]')).host; > > I get `[::1]`, which indicates the normalization happening. And likewise > will: > > (new URL('https://[2001:db8:0:0:0:0:0:1]')).host; > > result in `[2001:db8::1]`. > Yes, I realized that you are right. IP6 support used to be indeed incomplete or buggy until now, but I took some time, and corrected the behavior. > My expectation be be `[2001:db8:0:0:0:0:0:1]` for Rfc3986 and > `[2001:db8::1]` for WhatWg. I have also tested the behavior of > `withHost()` when leaving out the square brackets. The Rfc3986 correctly > throws an Exception, but WhatWg silently does nothing: > > $url = 'https://example.com/foo/path'; > > var_dump((new > Uri\WhatWg\Url($url))->withHost('2001:db8:0:0:0:0:0:1')->toAsciiString()); > > results in > > string(28) "https://example.com/foo/path" > This looks like this is the result of WHATWG's host setter algorithm ( https://url.spec.whatwg.org/#dom-url-hostname). After debugging the behavior, I noticed that "new Uri\WhatWg\Url('2001:db8:0:0:0:0:0:1')" only fails when trying to parse the port after the first ":" character. However, the setter algorithm obviously doesn't reach this point, since it only tries to parse the host, and then it stops (because of the state override). So I'm not sure this gotcha can be cured. I tried to reproduce the problem in Chrome, but I realized that the URL properties are not validated at all when they are set ("url.hostname = "2001:db8:0:0:0:0:0:1";" will change the hostname no problem)... Regards, Máté