Re: [PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing API

Dennis Snell Thu, 02 Jan 2025 23:20:00 -0800

It seems that I’ve mucked up the mailing list again by deleting an old message 
I intended on replying to. Apologies all around for replying to an older 
message of my own.
Máté, thanks for your continued work on the URL parsing RFC. I’m only no 
returning from a bit of an extended leave, so I appreciate your diligence and 
patience. Here are some thoughts in response to your message from Nov. 19, 2024.


> even though the majority does, not everyone builds a browser application
with PHP, especially because URIs are not necessarily accessible on the web

This has largely been touched on elsewhere, but I will echo the idea that it 
seems valid to have to separate parsers for the two standards, and truly they 
diverge enough that it seems like it could be only a superficial thing for them 
to share an interface.

I only harp on the WhatWG spec so much because for many people this will be the 
only one they are aware of, if they are aware of any spec at all, and this is a 
sizable vector of attack targeting servers from user-supplied content. I’m 
curious to hear from folks here hat fraction of the actual PHP code deals with 
RFC3986 URLs, and of those, if the systems using them are truly RFC3986 systems 
or if the common-enough URLs are valid in both specs.

Just to enlighten me and possibly others with less familiarity, how and when 
are RFC3986 URLs used and what are those systems supposed to do when an invalid 
URL appears, such as when dealing with percent-encodings as you brought up in 
response to Tim?

Coming from the XHTML/HTML/XML side I know that there was substantial effort to 
enforce standards on browsers and that led to decades of security exploits and 
confusion, when the “official” standards never fully existed in the way people 
thought. I don’t mean to start any flame wars, but is the URL story at all 
similar here?

I’m mostly worried that we could accidentally encourage risky behavior for 
developers who aren’t familiar with the nuances of having to URL specifications 
vs. having the simplest, least-specific interface point them in the right 
direction for what they will probably be doing. `parse_url()` is a great 
example of how the thing that looks _right_ is actually terribly prone to 
failure.

> The Uri\WhatWgUri::parse() method accepts a (relative) URI parameter when the 
> 2nd (base URI) parameter is provided. So essentially you need to use 
this variant of the parse() method if you want to parse a WhatWg compliant 
URL

If this means passing something like the following then I suppose it’s okay. It 
would be nice to be able to know without passing the second parameter, as there 
are multitude cases where no such base URL would be available, and some dummy 
parameter would need to be provided.

```
    $url = Uri\WhatWgUri::parse( $url, 'https://example.com' )
    var_dump( $url->is_relative_or_something_like_that );
```

This would be fine, knowing in hindsight that it was originally a relative 
path. Of course, this would mean that it’s critical that `https://example.com` 
does not replace the actual host part if one is provided in `$url`. For 
example, this code should work.

```
    $url = Uri\WhatWgUri::parse( 'https://wiki.php.net/rfc’, 
‘https://example.com’ );
    $url->domain === 'wiki.php.net'
```
> The forDisplay() method also seems to be useful at the first glance, but 
> since this may be a controversial optional feature, I'd defer it for later…

Hopefully this won’t be too controversial, even though the concept was new to 
me when I started having to reliably work with URLs. I choose the example I did 
because of human risk factors in security exploits.  "xn--google.com" is not in 
fact a Google domain, but an IDNA domain decoding to "䕮䕵䕶䕱.com”

This is a misleading URL to human readers, which is why the WhatWG indicates 
that “browsers should render a URL’s host by running domain to Unicode with the 
URL’s host and false.” [https://url.spec.whatwg.org/#url-rendering-i18n].

The lack of a standard method here means that (a) most code won’t render the 
URLs the way a human would recognize them, and (b) those who do will run to 
inefficient and likely-incomplete user-space code to try and decode/render 
these hosts.

It may be something fine for a follow-up to this work, but it’s also something 
I personally consider essential for any native support of handling URLs that 
are destined for human review. If sending to an `href` attribute it should be 
the normalized URL; but if displayed as text it should be easy to prevent 
tricking people in this way.

In my HTML decoding RFC I tried to bake in this decision in the type of the 
function using an enum. Since I figured most people are unaware of the role of 
the context in which HTML text is decoded, I found the enum to be a suitable 
convenience as well as educational tool.

```
    $url->toString( Uri\WhatWg\RenderContext::ForHumans ); // 䕮䕵䕶䕱.com
    $url->toString( Uri\WhatWg\RenderContext::ForMachines ); // xn-google.com
```

The names probably are terrible in all of my code snippets, but at this point 
I’m not proposing actual names, just code samples good enough to illustrate the 
point. By forcing a choice here (no default value) someone will see the options 
and probably make the right call.

----

This is all looking quite nice. I’m happy to see how the RFC continues to 
develop, and I’m eagerly looking forward to being able to finally rely on PHP’s 
handling of URLs.

Happy new year,
Dennis Snell

Re: [PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing API

Reply via email to