It seems that I’ve mucked up the mailing list again by deleting an old message
I intended on replying to. Apologies all around for replying to an older
message of my own.
Máté, thanks for your continued work on the URL parsing RFC. I’m only no
returning from a bit of an extended leave, so I appreciate your diligence and
patience. Here are some thoughts in response to your message from Nov. 19, 2024.
> even though the majority does, not everyone builds a browser application
with PHP, especially because URIs are not necessarily accessible on the web
This has largely been touched on elsewhere, but I will echo the idea that it
seems valid to have to separate parsers for the two standards, and truly they
diverge enough that it seems like it could be only a superficial thing for them
to share an interface.
I only harp on the WhatWG spec so much because for many people this will be the
only one they are aware of, if they are aware of any spec at all, and this is a
sizable vector of attack targeting servers from user-supplied content. I’m
curious to hear from folks here hat fraction of the actual PHP code deals with
RFC3986 URLs, and of those, if the systems using them are truly RFC3986 systems
or if the common-enough URLs are valid in both specs.
Just to enlighten me and possibly others with less familiarity, how and when
are RFC3986 URLs used and what are those systems supposed to do when an invalid
URL appears, such as when dealing with percent-encodings as you brought up in
response to Tim?
Coming from the XHTML/HTML/XML side I know that there was substantial effort to
enforce standards on browsers and that led to decades of security exploits and
confusion, when the “official” standards never fully existed in the way people
thought. I don’t mean to start any flame wars, but is the URL story at all
similar here?
I’m mostly worried that we could accidentally encourage risky behavior for
developers who aren’t familiar with the nuances of having to URL specifications
vs. having the simplest, least-specific interface point them in the right
direction for what they will probably be doing. `parse_url()` is a great
example of how the thing that looks _right_ is actually terribly prone to
failure.
> The Uri\WhatWgUri::parse() method accepts a (relative) URI parameter when the
> 2nd (base URI) parameter is provided. So essentially you need to use
this variant of the parse() method if you want to parse a WhatWg compliant
URL
If this means passing something like the following then I suppose it’s okay. It
would be nice to be able to know without passing the second parameter, as there
are multitude cases where no such base URL would be available, and some dummy
parameter would need to be provided.
```
$url = Uri\WhatWgUri::parse( $url, 'https://example.com' )
var_dump( $url->is_relative_or_something_like_that );
```
This would be fine, knowing in hindsight that it was originally a relative
path. Of course, this would mean that it’s critical that `https://example.com`
does not replace the actual host part if one is provided in `$url`. For
example, this code should work.
```
$url = Uri\WhatWgUri::parse( 'https://wiki.php.net/rfc’,
‘https://example.com’ );
$url->domain === 'wiki.php.net'
```
> The forDisplay() method also seems to be useful at the first glance, but
> since this may be a controversial optional feature, I'd defer it for later…
Hopefully this won’t be too controversial, even though the concept was new to
me when I started having to reliably work with URLs. I choose the example I did
because of human risk factors in security exploits. "xn--google.com" is not in
fact a Google domain, but an IDNA domain decoding to "䕮䕵䕶䕱.com”
This is a misleading URL to human readers, which is why the WhatWG indicates
that “browsers should render a URL’s host by running domain to Unicode with the
URL’s host and false.” [https://url.spec.whatwg.org/#url-rendering-i18n].
The lack of a standard method here means that (a) most code won’t render the
URLs the way a human would recognize them, and (b) those who do will run to
inefficient and likely-incomplete user-space code to try and decode/render
these hosts.
It may be something fine for a follow-up to this work, but it’s also something
I personally consider essential for any native support of handling URLs that
are destined for human review. If sending to an `href` attribute it should be
the normalized URL; but if displayed as text it should be easy to prevent
tricking people in this way.
In my HTML decoding RFC I tried to bake in this decision in the type of the
function using an enum. Since I figured most people are unaware of the role of
the context in which HTML text is decoded, I found the enum to be a suitable
convenience as well as educational tool.
```
$url->toString( Uri\WhatWg\RenderContext::ForHumans ); // 䕮䕵䕶䕱.com
$url->toString( Uri\WhatWg\RenderContext::ForMachines ); // xn-google.com
```
The names probably are terrible in all of my code snippets, but at this point
I’m not proposing actual names, just code samples good enough to illustrate the
point. By forcing a choice here (no default value) someone will see the options
and probably make the right call.
----
This is all looking quite nice. I’m happy to see how the RFC continues to
develop, and I’m eagerly looking forward to being able to finally rely on PHP’s
handling of URLs.
Happy new year,
Dennis Snell