Hi Ignace, Niels, Sorry for being silent for so long, I was working hard on the implementation besides some summer activities :) I can say that I had really good progress in the last month and now I think (hope) that I managed to address most of the concerns/suggestions people mentioned in this thread. To summarize the most important changes:
- The uriparser library is now used for parsing URIs based on RFC 3986. - I renamed the extension to "uri" in favor of "url" in order to make the name more generic and to express the new use-case. - There is no Url\UrlParser class anymore. The Uri\Uri class now includes the relevant factory methods. - Uri/Uri is now an abstract class which is implemented by 2 concrete classes: Uri\Rfc3986Uri and Uri\WhatwgUri. - WhatWG URL parsing now returns the exact error code according to the specification (although a reference parameter is used for now - but this is TBD) - As suggested by Niels, it's now possible to plug an URI parsing implementation into PHP. A new uri.default_handler INI option is also added. Currently, integration is only implemented for FILTER_VALIDATE_URL though. The approach also makes it possible to register additional 3rd party libraries for parsing URIs (like ADA URL). - It looks like that performance significantly improved according to the rough benchmarks performed in CI. Please re-read the RFC as it shares a bit more details than my quick summary above: https://wiki.php.net/rfc/url_parsing_api There are some questions I still didn't manage to find an answer for though. Most importantly, the URI parser libraries used don't support modification of the URI. That's why I had to get rid of the "wither" methods for now which were originally part of the API. I think it's unfortunate, and I'll try to do my best to reclaim them. Additionally, due to technical reasons, extending the Uri\Uri class in userland is only possible if all the methods are overridden by the child. It's because I had to use "computed" properties in the implementation (roughly, they are stored in an internal C struct unlike regular properties). That's why it may be better if userland code could use (and possibly implement) an Uri\Uri interface instead. In one of my previous emails, I had some concerns that RFC 3986 and WhatWg spec can really share the same interface (they do in my current implementation despite that they are different classes). I still share this concern because WhatWg specifies the "user" and "password" URL components, while RFC 3986 only specifies the notion of "userinfo" (which is usually just user:password, but it's not necessarily the case as far as I understood). The RFC implementation of the RFC 3986 parser currently splits the 'userinfo' component at the ":" character, but doing so doesn't seem very spec compliant. Arnaud suggested that it would be better if the query parameters could be retrieved both escaped and unescaped after parsing. I haven't had time to investigate the possibilities, but my gut feeling is that it's only possible to achieve with some custom code. Arnaud also had questions regarding canonization. Currently, it's not performed when calling the __toString() method, because only uriparser library supports this feature, and I didn't want to diverge the two implementations. I'm not even sure that it's a good idea to always do it so I'm thinking about the possibility to selectively enable this feature (i.e. adding a separate "toCanonizedString" method). Regards, Máté