On 03/01/2025 08:18, Dennis Snell wrote:
It seems that I’ve mucked up the mailing list again by deleting an old message
I intended on replying to. Apologies all around for replying to an older
message of my own.
Máté, thanks for your continued work on the URL parsing RFC. I’m only no
returning from a bit of an extended leave, so I appreciate your diligence and
patience. Here are some thoughts in response to your message from Nov. 19, 2024.
even though the majority does, not everyone builds a browser application
with PHP, especially because URIs are not necessarily accessible on the web
This has largely been touched on elsewhere, but I will echo the idea that it
seems valid to have to separate parsers for the two standards, and truly they
diverge enough that it seems like it could be only a superficial thing for them
to share an interface.
I only harp on the WhatWG spec so much because for many people this will be the
only one they are aware of, if they are aware of any spec at all, and this is a
sizable vector of attack targeting servers from user-supplied content. I’m
curious to hear from folks here hat fraction of the actual PHP code deals with
RFC3986 URLs, and of those, if the systems using them are truly RFC3986 systems
or if the common-enough URLs are valid in both specs.
Just to enlighten me and possibly others with less familiarity, how and when
are RFC3986 URLs used and what are those systems supposed to do when an invalid
URL appears, such as when dealing with percent-encodings as you brought up in
response to Tim?
Coming from the XHTML/HTML/XML side I know that there was substantial effort to
enforce standards on browsers and that led to decades of security exploits and
confusion, when the “official” standards never fully existed in the way people
thought. I don’t mean to start any flame wars, but is the URL story at all
similar here?
I’m mostly worried that we could accidentally encourage risky behavior for
developers who aren’t familiar with the nuances of having to URL specifications
vs. having the simplest, least-specific interface point them in the right
direction for what they will probably be doing. `parse_url()` is a great
example of how the thing that looks _right_ is actually terribly prone to
failure.
The Uri\WhatWgUri::parse() method accepts a (relative) URI parameter when the
2nd (base URI) parameter is provided. So essentially you need to use
this variant of the parse() method if you want to parse a WhatWg compliant
URL
If this means passing something like the following then I suppose it’s okay. It
would be nice to be able to know without passing the second parameter, as there
are multitude cases where no such base URL would be available, and some dummy
parameter would need to be provided.
```
$url = Uri\WhatWgUri::parse( $url, 'https://example.com' )
var_dump( $url->is_relative_or_something_like_that );
```
This would be fine, knowing in hindsight that it was originally a relative
path. Of course, this would mean that it’s critical that `https://example.com`
does not replace the actual host part if one is provided in `$url`. For
example, this code should work.
```
$url = Uri\WhatWgUri::parse( 'https://wiki.php.net/rfc’,
‘https://example.com’ );
$url->domain === 'wiki.php.net'
```
The forDisplay() method also seems to be useful at the first glance, but since
this may be a controversial optional feature, I'd defer it for later…
Hopefully this won’t be too controversial, even though the concept was new to me when I
started having to reliably work with URLs. I choose the example I did because of human risk
factors in security exploits. "xn--google.com" is not in fact a Google domain, but
an IDNA domain decoding to "䕮䕵䕶䕱.com”
This is a misleading URL to human readers, which is why the WhatWG indicates
that “browsers should render a URL’s host by running domain to Unicode with the
URL’s host and false.” [https://url.spec.whatwg.org/#url-rendering-i18n].
The lack of a standard method here means that (a) most code won’t render the
URLs the way a human would recognize them, and (b) those who do will run to
inefficient and likely-incomplete user-space code to try and decode/render
these hosts.
It may be something fine for a follow-up to this work, but it’s also something
I personally consider essential for any native support of handling URLs that
are destined for human review. If sending to an `href` attribute it should be
the normalized URL; but if displayed as text it should be easy to prevent
tricking people in this way.
In my HTML decoding RFC I tried to bake in this decision in the type of the
function using an enum. Since I figured most people are unaware of the role of
the context in which HTML text is decoded, I found the enum to be a suitable
convenience as well as educational tool.
```
$url->toString( Uri\WhatWg\RenderContext::ForHumans ); // 䕮䕵䕶䕱.com
$url->toString( Uri\WhatWg\RenderContext::ForMachines ); // xn-google.com
```
The names probably are terrible in all of my code snippets, but at this point
I’m not proposing actual names, just code samples good enough to illustrate the
point. By forcing a choice here (no default value) someone will see the options
and probably make the right call.
----
This is all looking quite nice. I’m happy to see how the RFC continues to
develop, and I’m eagerly looking forward to being able to finally rely on PHP’s
handling of URLs.
Happy new year,
Dennis Snell
Hi Dennis,
> I’m curious to hear from folks here hat fraction of the actual PHP
code deals with RFC3986 URLs, and of those, if the systems using them
are truly RFC3986 systems or if the common-enough URLs are valid in both
specs.
Here's my take on both RFC. RFC3986/87 is a "parsing" RFC which leave
the validation to each individual scheme, for instance the following URL
is valid under RFC3986 but will be problematic under WHATWG URL spec
```
ldap://ldap1.example.net:6666/o=University%20of%20Michigan,c=US??sub?(cn=Babs%20Jensen)
```
The LDAP URL is RFC3986 compliant but adds its own validation rules on
top of the RFC. This means that LDAP URL generation would be problematic
if we only implement the WHATWG spec, hence why having a RFC3986/87 URI
in PHP is crucial.
Futhermore, the WHATWG spec not only parses but also in the same time
validates and more agressively normalizes the URL something the RFC3986
does not do or more precisely recognizes and categorizes in two
categories, the non-destructive and the destructive normalizations.
These normalization affect the scheme, the path and also the host which
can be very impactful in your application.
```pbp
For the following URL 'https://0073.0232.0311.0377/b'
RFC3986: 'https://0073.0232.0311.0377/b'
WHATWG URL: 'https://59.154.201.255/b'
```
So this can be a source of confusion for developper. Last but not least
RFC3986 alone will never be able to parses IDN domain names and required
suport of RFC3987 IDN domains to do so.
Hopefully with those examples you will understand the strenghts and
weaknesses of each spec and why IMHO PHP needs both to be up to date.