Hi Ignace & Maté and all, tl;dr: I argue against Ignace's objections to splitting the URI class into two classes (one that retains raw URI values and another that normalizes values as-it-goes). Jump to the very end for a discussion regarding the with() methods (search for the word "asymmetry" herein).
* * * > On Apr 28, 2025, at 15:47, ignace nyamagana butera <nyamsp...@gmail.com> > wrote: > > The current approach in userland mixes both raw and half normalized > components as well as RFC3986 and RFC3987 specification with ambiguity around > normalization, input, constructior, what needs to be encoded where and when Based on my research into existing URI projects <https://github.com/uri-interop/interface/blob/1.x/README-RESEARCH.md> I don't think that's an accurate assessment of the ecosystem. For example, can you point out which projects mix "raw and half-normalized components"? Nette is the only one that comes to mind, in that (during parsing) it applies rawurldecode() to the host, user, password, and fragment; but that's only one of the 18 projects. Likewise, of the 15 URI-centric projects, only one of them (league/uri) offers both RFC3986 and 3987 parsing; the two IRI-centric projects (ml/iri and rmccue/requests) are explicitly IRIs; and rowbot is clearly WHATWG-URL centric. So I don't see much ambiguity in any projects there. As far as normalization, only one project (opis) affords the ability to normalize at creation time, though five of them offer a normalize() method with various effects (<https://github.com/uri-interop/interface/blob/1.x/README-RESEARCH.md#normalizing>). So, again, I don't see much ambiguity there either; they don't do normalizing as-you-go, it's something you have to apply explicitly. Regarding inputs, they all presume "raw" inputs. Regarding constructors, they mostly side with a full URI string. Regarding encoding, they mostly retain values in their encoded form (there are three outliers, cf. <https://github.com/uri-interop/interface/blob/1.x/README-RESEARCH.md#component-encoding>). With all that in mind, we can see that the various authors of userland projects have settled on remarkably similar patterns of usage that they found valuable and useful for working with URIs. > > - fulfill existing userland expectations; > > Existing userland expectations are mostly built around `parse_url` That's kind of true; 9 of the 18 projects use parse_url(), and 7/18 implement the RFC 3986 parsing algorithm ... > which is one of the reasons the RFC exists to improve the status quo and to > introduce in PHP valid parsers against recognizable URI specifications. Yes > some adaptation will be needed to use them in userland but I believe this > work is easy to do, talking from the POV of a URI package maintainer. ... but I don't imagine that replacing parse_url() in those projects with the RFC 3986 algo would cause those projects to change any of their other design decisions. What adaptations do you think would be needed around that replacement? > > - replace the toString()/toRawString() with a single idiomatic __toString() > > in each class; > > For all the reasons explained in the RFC, adding a `__toString` method is a > bad architectural design for an URI. There are so many ways to represent an > URI that having a `__toString` for string representation gives a false sense > of "there can be only one true representation for a single URI" which is not > true. For Rfc3986\Uri, it looks like there are only two that are recognized: raw and normalized. Are there other string representations you feel the Uri class should recognize? (For Whatwg\Url, it looks like there are also only two: as-parsed, and as ASCII, but I'm not addressing that part of the RFC here.) > > - move normalization logic into the NormalizedUri class. > > The classes follow specifications that describe how normalization should be. > Why would you split the responsibilities in other classes ? What would be the > added value ? For one, unless I am missing something, there is an asymmetry between the get() methods and the with() methods. What I'm seeing is that (e.g.) Uri::withPath() expects a raw path argument, but getPath() returns the normalized version. For symmetry, I would expect either: - `Uri::withPath(raw_value) : self` and `Uri::getPath() : raw_value`, or - `Uri::withRawPath(raw_value) : self` and `Uri::getRawPath() : raw_value` Thus my first intuition that the "main" values in the URI need to be the raw ones, and that getting the normalized ones should be the more verbose case (e.g. `getNormalizedPath() : normalized_value`). So, one value added by splitting the classes is to resolve that asymmetry. Consumers expecting to get back from the URI what they put into it can use the raw Uri variation; "API clients or signers fall in this category that want to avoid introducing any unnecessary changes to URIs, in order to avoid causing subtle bugs." Other consumers, who want to do things this new and different way (normalized as-you-go, unlike anything currently in userland) can use the NormalizedUri. (Or you could flip it around and say that the normalized variation is the Uri class, and the raw version is RawUri.) -- pmj