Re: [PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing API

ignace nyamagana butera Mon, 13 Jan 2025 07:11:01 -0800

On 03/01/2025 08:18, Dennis Snell wrote:

It seems that I’ve mucked up the mailing list again by deleting an old message 
I intended on replying to. Apologies all around for replying to an older 
message of my own.
Máté, thanks for your continued work on the URL parsing RFC. I’m only no 
returning from a bit of an extended leave, so I appreciate your diligence and 
patience. Here are some thoughts in response to your message from Nov. 19, 2024.

even though the majority does, not everyone builds a browser application

with PHP, especially because URIs are not necessarily accessible on the web

This has largely been touched on elsewhere, but I will echo the idea that it 
seems valid to have to separate parsers for the two standards, and truly they 
diverge enough that it seems like it could be only a superficial thing for them 
to share an interface.

I only harp on the WhatWG spec so much because for many people this will be the 
only one they are aware of, if they are aware of any spec at all, and this is a 
sizable vector of attack targeting servers from user-supplied content. I’m 
curious to hear from folks here hat fraction of the actual PHP code deals with 
RFC3986 URLs, and of those, if the systems using them are truly RFC3986 systems 
or if the common-enough URLs are valid in both specs.

Just to enlighten me and possibly others with less familiarity, how and when 
are RFC3986 URLs used and what are those systems supposed to do when an invalid 
URL appears, such as when dealing with percent-encodings as you brought up in 
response to Tim?

Coming from the XHTML/HTML/XML side I know that there was substantial effort to 
enforce standards on browsers and that led to decades of security exploits and 
confusion, when the “official” standards never fully existed in the way people 
thought. I don’t mean to start any flame wars, but is the URL story at all 
similar here?

I’m mostly worried that we could accidentally encourage risky behavior for 
developers who aren’t familiar with the nuances of having to URL specifications 
vs. having the simplest, least-specific interface point them in the right 
direction for what they will probably be doing. `parse_url()` is a great 
example of how the thing that looks _right_ is actually terribly prone to 
failure.

The Uri\WhatWgUri::parse() method accepts a (relative) URI parameter when the 
2nd (base URI) parameter is provided. So essentially you need to use

this variant of the parse() method if you want to parse a WhatWg compliant
URL

If this means passing something like the following then I suppose it’s okay. It 
would be nice to be able to know without passing the second parameter, as there 
are multitude cases where no such base URL would be available, and some dummy 
parameter would need to be provided.

```
     $url = Uri\WhatWgUri::parse( $url, 'https://example.com' )
     var_dump( $url->is_relative_or_something_like_that );
```

This would be fine, knowing in hindsight that it was originally a relative 
path. Of course, this would mean that it’s critical that `https://example.com` 
does not replace the actual host part if one is provided in `$url`. For 
example, this code should work.

```
     $url = Uri\WhatWgUri::parse( 'https://wiki.php.net/rfc’, 
‘https://example.com’ );
     $url->domain === 'wiki.php.net'
```

The forDisplay() method also seems to be useful at the first glance, but since 
this may be a controversial optional feature, I'd defer it for later…

Hopefully this won’t be too controversial, even though the concept was new to me when I
started having to reliably work with URLs. I choose the example I did because of human risk
factors in security exploits. "xn--google.com" is not in fact a Google domain, but
an IDNA domain decoding to "䕮䕵䕶䕱.com”

This is a misleading URL to human readers, which is why the WhatWG indicates
that “browsers should render a URL’s host by running domain to Unicode with the
URL’s host and false.” [https://url.spec.whatwg.org/#url-rendering-i18n].

The lack of a standard method here means that (a) most code won’t render the
URLs the way a human would recognize them, and (b) those who do will run to
inefficient and likely-incomplete user-space code to try and decode/render
these hosts.

It may be something fine for a follow-up to this work, but it’s also something
I personally consider essential for any native support of handling URLs that
are destined for human review. If sending to an `href` attribute it should be
the normalized URL; but if displayed as text it should be easy to prevent
tricking people in this way.

In my HTML decoding RFC I tried to bake in this decision in the type of the
function using an enum. Since I figured most people are unaware of the role of
the context in which HTML text is decoded, I found the enum to be a suitable
convenience as well as educational tool.

```
$url->toString( Uri\WhatWg\RenderContext::ForHumans ); // 䕮䕵䕶䕱.com
$url->toString( Uri\WhatWg\RenderContext::ForMachines ); // xn-google.com
```

The names probably are terrible in all of my code snippets, but at this point
I’m not proposing actual names, just code samples good enough to illustrate the
point. By forcing a choice here (no default value) someone will see the options
and probably make the right call.

----

This is all looking quite nice. I’m happy to see how the RFC continues to
develop, and I’m eagerly looking forward to being able to finally rely on PHP’s
handling of URLs.

Happy new year,
Dennis Snell


Hi Dennis,

> I’m curious to hear from folks here hat fraction of the actual PHPcode deals with RFC3986 URLs, and of those, if the systems using themare truly RFC3986 systems or if the common-enough URLs are valid in bothspecs.

Here's my take on both RFC. RFC3986/87 is a "parsing" RFC which leavethe validation to each individual scheme, for instance the following URLis valid under RFC3986 but will be problematic under WHATWG URL spec


```
ldap://ldap1.example.net:6666/o=University%20of%20Michigan,c=US??sub?(cn=Babs%20Jensen)
```

The LDAP URL is RFC3986 compliant but adds its own validation rules ontop of the RFC. This means that LDAP URL generation would be problematicif we only implement the WHATWG spec, hence why having a RFC3986/87 URIin PHP is crucial.

Futhermore, the WHATWG spec not only parses but also in the same timevalidates and more agressively normalizes the URL something the RFC3986does not do or more precisely recognizes and categorizes in twocategories, the non-destructive and the destructive normalizations.These normalization affect the scheme, the path and also the host whichcan be very impactful in your application.


```pbp
For the following URL 'https://0073.0232.0311.0377/b'

RFC3986:    'https://0073.0232.0311.0377/b'
WHATWG URL: 'https://59.154.201.255/b'
```

So this can be a source of confusion for developper. Last but not leastRFC3986 alone will never be able to parses IDN domain names and requiredsuport of RFC3987 IDN domains to do so.

Hopefully with those examples you will understand the strenghts andweaknesses of each spec and why IMHO PHP needs both to be up to date.

Re: [PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing API

Reply via email to