Re: [PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing API

Máté Kocsis Mon, 26 Aug 2024 00:42:10 -0700

Hi Ignace, Niels,

Sorry for being silent for so long, I was working hard on the
implementation besides some summer activities :) I can say that I had
really good progress in the last month and now I think (hope) that I
managed to address most of the concerns/suggestions people mentioned
in this thread. To summarize the most important changes:


- The uriparser library is now used for parsing URIs based on RFC 3986.
- I renamed the extension to "uri" in favor of "url" in order to
make the name more generic and to express the new use-case.
- There is no Url\UrlParser class anymore. The Uri\Uri class now includes
the relevant factory methods.
- Uri/Uri is now an abstract class which is implemented by 2 concrete
classes: Uri\Rfc3986Uri and Uri\WhatwgUri.
- WhatWG URL parsing now returns the exact error code according to the
specification (although a reference parameter is used for now - but this is
TBD)
- As suggested by Niels, it's now possible to plug an URI parsing
implementation into PHP. A new uri.default_handler INI option is also added.
Currently, integration is only implemented for FILTER_VALIDATE_URL though.
The approach also makes it possible to register additional 3rd party
libraries for parsing URIs (like ADA URL).
- It looks like that performance significantly improved according to the
rough benchmarks performed in CI.

Please re-read the RFC as it shares a bit more details than my quick
summary above: https://wiki.php.net/rfc/url_parsing_api

There are some questions I still didn't manage to find an answer for
though. Most importantly, the URI parser libraries used don't support
modification
of the URI. That's why I had to get rid of the "wither" methods for now
which were originally part of the API. I think it's unfortunate, and I'll
try to do my
best to reclaim them.

Additionally, due to technical reasons, extending the Uri\Uri class in
userland is only possible if all the methods are overridden by the child.
It's because
I had to use "computed" properties in the implementation (roughly, they are
stored in an internal C struct unlike regular properties). That's why it
may be
better if userland code could use (and possibly implement) an Uri\Uri
interface instead.

In one of my previous emails, I had some concerns that RFC 3986 and WhatWg
spec can really share the same interface (they do in my current
implementation
despite that they are different classes). I still share this concern
because WhatWg specifies the "user" and "password" URL components, while
RFC 3986
only specifies the notion of "userinfo" (which is usually just
user:password, but it's not necessarily the case as far as I understood).
The RFC implementation
of the RFC 3986 parser currently splits the 'userinfo' component at the ":"
character, but doing so doesn't seem very spec compliant.

Arnaud suggested that it would be better if the query parameters could be
retrieved both escaped and unescaped after parsing. I haven't had time to
investigate
the possibilities, but my gut feeling is that it's only possible to achieve
with some custom code. Arnaud also had questions regarding canonization.
Currently,
it's not performed when calling the __toString() method, because only
uriparser library supports this feature, and I didn't want to diverge the
two implementations.
I'm not even sure that it's a good idea to always do it so I'm thinking
about the possibility to selectively enable this feature (i.e. adding a
separate "toCanonizedString"
method).

Regards,
Máté

Re: [PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing API

Reply via email to