Re: [PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing API

Paul M. Jones Tue, 29 Apr 2025 06:56:49 -0700

Hi Ignace & Maté and all,

tl;dr: I argue against Ignace's objections to splitting the URI class into two 
classes (one that retains raw URI values and another that normalizes values 
as-it-goes). Jump to the very end for a discussion regarding the with() methods 
(search for the word "asymmetry" herein).

* * *

> On Apr 28, 2025, at 15:47, ignace nyamagana butera <nyamsp...@gmail.com> 
> wrote:
> 
> The current approach in userland mixes both raw and half normalized 
> components as well as RFC3986 and RFC3987 specification with ambiguity around 
> normalization, input, constructior, what needs to be encoded where and when

Based on my research into existing URI projects 
<https://github.com/uri-interop/interface/blob/1.x/README-RESEARCH.md> I don't 
think that's an accurate assessment of the ecosystem.

For example, can you point out which projects mix "raw and half-normalized 
components"? Nette is the only one that comes to mind, in that (during parsing) 
it applies rawurldecode() to the host, user, password, and fragment; but that's 
only one of the 18 projects.

Likewise, of the 15 URI-centric projects, only one of them (league/uri) offers 
both RFC3986 and 3987 parsing; the two IRI-centric projects (ml/iri and 
rmccue/requests) are explicitly IRIs; and rowbot is clearly WHATWG-URL centric. 
 So I don't see much ambiguity in any projects there.

As far as normalization, only one project (opis) affords the ability to 
normalize at creation time, though five of them offer a normalize() method with 
various effects 
(<https://github.com/uri-interop/interface/blob/1.x/README-RESEARCH.md#normalizing>).
 So, again, I don't see much ambiguity there either; they don't do normalizing 
as-you-go, it's something you have to apply explicitly.

Regarding inputs, they all presume "raw" inputs. Regarding constructors, they 
mostly side with a full URI string. Regarding encoding, they mostly retain 
values in their encoded form (there are three outliers, cf. 
<https://github.com/uri-interop/interface/blob/1.x/README-RESEARCH.md#component-encoding>).

With all that in mind, we can see that the various authors of userland projects 
have settled on remarkably similar patterns of usage that they found valuable 
and useful for working with URIs.

> > - fulfill existing userland expectations;
> 
> Existing userland expectations are mostly built around `parse_url`

That's kind of true; 9 of the 18 projects use parse_url(), and 7/18 implement 
the RFC 3986 parsing algorithm ...

> which is one of the reasons the RFC exists to improve the status quo and to 
> introduce in PHP valid parsers against recognizable URI specifications. Yes 
> some adaptation will be needed to use them in userland but I believe this 
> work is easy to do, talking from the POV of a URI package maintainer.

... but I don't imagine that replacing parse_url() in those projects with the 
RFC 3986 algo would cause those projects to change any of their other design 
decisions. What adaptations do you think would be needed around that 
replacement?

> > - replace the toString()/toRawString() with a single idiomatic __toString() 
> > in each class;
> 
> For all the reasons explained in the RFC, adding a `__toString` method is a 
> bad architectural design for an URI. There are so many ways to represent an 
> URI that  having a `__toString` for string representation gives a false sense 
> of "there can be only one true representation for a single URI" which is not 
> true.

For Rfc3986\Uri, it looks like there are only two that are recognized: raw and 
normalized. Are there other string representations you feel the Uri class 
should recognize?

(For Whatwg\Url, it looks like there are also only two: as-parsed, and as 
ASCII, but I'm not addressing that part of the RFC here.)

> > - move normalization logic into the NormalizedUri class.
> 
> The classes follow  specifications that describe how normalization should be. 
> Why would you split the responsibilities in other classes ? What would be the 
> added value ? 

For one, unless I am missing something, there is an asymmetry between the get() 
methods and the with() methods. What I'm seeing is that (e.g.) Uri::withPath() 
expects a raw path argument, but getPath() returns the normalized version.  For 
symmetry, I would expect either:

- `Uri::withPath(raw_value) : self` and `Uri::getPath() : raw_value`, or
- `Uri::withRawPath(raw_value) : self` and `Uri::getRawPath() : raw_value`

Thus my first intuition that the "main" values in the URI need to be the raw 
ones, and that getting the normalized ones should be the more verbose case 
(e.g. `getNormalizedPath() : normalized_value`).

So, one value added by splitting the classes is to resolve that asymmetry. 
Consumers expecting to get back from the URI what they put into it can use the 
raw Uri variation; "API clients or signers fall in this category that want to 
avoid introducing any unnecessary changes to URIs, in order to avoid causing 
subtle bugs." 

Other consumers, who want to do things this new and different way (normalized 
as-you-go, unlike anything currently in userland) can use the NormalizedUri.

(Or you could flip it around and say that the normalized variation is the Uri 
class, and the raw version is RawUri.)

-- pmj

Re: [PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing API

Reply via email to