Re: [PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing API

ignace nyamagana butera Tue, 29 Apr 2025 13:10:20 -0700

Hi Paul,

I will try to address your concerns. Keep in mind that I am not the author
of the RFC but I do like how it is currently shaped with some caveats but
those can be put under future improvements.

> So, one value added by splitting the classes is to resolve that asymmetry.

First, I agree with you. The method naming in the Uri\Rfc3986\Uri class
could be improved even though it does not represent a showstopper to me,
Adding the `raw` prefix or indeed flipping the raw* method and using
normalized* would perhaps make for some clarification but I will leave that
decision to Máté.
Apart from that, I believe the current RFC (especially around RFC3986) does
address most if not all the issues regarding the specification. RFC3986
provides information around 3 key URI features: parsing, resolution and
equivalence. In order to offer resolution and equivalence you ought to
address normalization and thus encoding. Any userland package that does
offer those features is required to handle component encoding/normalization
first before performing the expected operation. Hence why I believe that if
the new URI class does offer equivalence by consequence it can/should be
able to expose URI component normalization out of the box. The need for a
separate class is IMHO not needed.

> For example, can you point out which projects mix "raw and
half-normalized components"?

Laminas for example or any PSR implementing class will try to encode the
input string regardless of its encoding hence the wording around not to
double encode the string you often encounter in mutator method docblock.
The Uri on the other hand only expects well formed and encoded strings
which leaves room for no wrong interpretation. This is an area that is left
to be filled by URI packages for instance.

> For Rfc3986\Uri, it looks like there are only two that are recognized:
raw and normalized. Are there other string representations you feel the Uri
class should recognize?

If there are at least two representations possible then a `__toString`
method is still a bad design because it may lead the developper to think
that this is the only one string representation which is not true. Both
representations are equivalent and represent as much the URI. And as a
bonus, not having a `__toString` method prevents accidental URI comparison
using the `==` sign instead of using the correct `equals` method. (I know
that because I've seen codebase where PSR-7 URI instances are compared
using the class  `__toString` method  which is just wrong).

PS1: I do appreciate the work you did put into your study around URI
packages in the PHP ecosystem but we should not restrict the new API to
only resolve or align to those used solutions instead we should try to
expose an API susceptible to allow more flexibility than what PHP currently
offers.
PS2: I do not think the new API will replace the URI packages, we will
still need them because, in the case of RFC3986 URI class, parsing is just
one aspect or URI consumption, we still need scheme specific validation
that only PHP userland package can offer.

Best regards,
Ignace Nyamagana Butera

On Tue, Apr 29, 2025 at 3:55 PM Paul M. Jones <pmjo...@pmjones.io> wrote:

> Hi Ignace & Maté and all,
>
> tl;dr: I argue against Ignace's objections to splitting the URI class into
> two classes (one that retains raw URI values and another that normalizes
> values as-it-goes). Jump to the very end for a discussion regarding the
> with() methods (search for the word "asymmetry" herein).
>
> * * *
>
> > On Apr 28, 2025, at 15:47, ignace nyamagana butera <nyamsp...@gmail.com>
> wrote:
> >
> > The current approach in userland mixes both raw and half normalized
> components as well as RFC3986 and RFC3987 specification with ambiguity
> around normalization, input, constructior, what needs to be encoded where
> and when
>
> Based on my research into existing URI projects <
> https://github.com/uri-interop/interface/blob/1.x/README-RESEARCH.md> I
> don't think that's an accurate assessment of the ecosystem.
>
> For example, can you point out which projects mix "raw and half-normalized
> components"? Nette is the only one that comes to mind, in that (during
> parsing) it applies rawurldecode() to the host, user, password, and
> fragment; but that's only one of the 18 projects.
>
> Likewise, of the 15 URI-centric projects, only one of them (league/uri)
> offers both RFC3986 and 3987 parsing; the two IRI-centric projects (ml/iri
> and rmccue/requests) are explicitly IRIs; and rowbot is clearly WHATWG-URL
> centric.  So I don't see much ambiguity in any projects there.
>
> As far as normalization, only one project (opis) affords the ability to
> normalize at creation time, though five of them offer a normalize() method
> with various effects (<
> https://github.com/uri-interop/interface/blob/1.x/README-RESEARCH.md#normalizing>).
> So, again, I don't see much ambiguity there either; they don't do
> normalizing as-you-go, it's something you have to apply explicitly.
>
> Regarding inputs, they all presume "raw" inputs. Regarding constructors,
> they mostly side with a full URI string. Regarding encoding, they mostly
> retain values in their encoded form (there are three outliers, cf. <
> https://github.com/uri-interop/interface/blob/1.x/README-RESEARCH.md#component-encoding
> >).
>
> With all that in mind, we can see that the various authors of userland
> projects have settled on remarkably similar patterns of usage that they
> found valuable and useful for working with URIs.
>
>
> > > - fulfill existing userland expectations;
> >
> > Existing userland expectations are mostly built around `parse_url`
>
> That's kind of true; 9 of the 18 projects use parse_url(), and 7/18
> implement the RFC 3986 parsing algorithm ...
>
>
> > which is one of the reasons the RFC exists to improve the status quo and
> to introduce in PHP valid parsers against recognizable URI specifications.
> Yes some adaptation will be needed to use them in userland but I believe
> this work is easy to do, talking from the POV of a URI package maintainer.
>
> ... but I don't imagine that replacing parse_url() in those projects with
> the RFC 3986 algo would cause those projects to change any of their other
> design decisions. What adaptations do you think would be needed around that
> replacement?
>
>
> > > - replace the toString()/toRawString() with a single idiomatic
> __toString() in each class;
> >
> > For all the reasons explained in the RFC, adding a `__toString` method
> is a bad architectural design for an URI. There are so many ways to
> represent an URI that  having a `__toString` for string representation
> gives a false sense of "there can be only one true representation for a
> single URI" which is not true.
>
> For Rfc3986\Uri, it looks like there are only two that are recognized: raw
> and normalized. Are there other string representations you feel the Uri
> class should recognize?
>
> (For Whatwg\Url, it looks like there are also only two: as-parsed, and as
> ASCII, but I'm not addressing that part of the RFC here.)
>
>
> > > - move normalization logic into the NormalizedUri class.
> >
> > The classes follow  specifications that describe how normalization
> should be. Why would you split the responsibilities in other classes ? What
> would be the added value ?
>
> For one, unless I am missing something, there is an asymmetry between the
> get() methods and the with() methods. What I'm seeing is that (e.g.)
> Uri::withPath() expects a raw path argument, but getPath() returns the
> normalized version.  For symmetry, I would expect either:
>
> - `Uri::withPath(raw_value) : self` and `Uri::getPath() : raw_value`, or
> - `Uri::withRawPath(raw_value) : self` and `Uri::getRawPath() : raw_value`
>
> Thus my first intuition that the "main" values in the URI need to be the
> raw ones, and that getting the normalized ones should be the more verbose
> case (e.g. `getNormalizedPath() : normalized_value`).
>
> So, one value added by splitting the classes is to resolve that asymmetry.
> Consumers expecting to get back from the URI what they put into it can use
> the raw Uri variation; "API clients or signers fall in this category that
> want to avoid introducing any unnecessary changes to URIs, in order to
> avoid causing subtle bugs."
>
> Other consumers, who want to do things this new and different way
> (normalized as-you-go, unlike anything currently in userland) can use the
> NormalizedUri.
>
> (Or you could flip it around and say that the normalized variation is the
> Uri class, and the raw version is RawUri.)
>
>
>
> -- pmj
>
>

Re: [PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing API

Reply via email to