Re: [PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing API

ignace nyamagana butera Sat, 28 Dec 2024 12:34:34 -0800

On 28/12/2024 14:42, Máté Kocsis wrote:

Hi Ignace,
Thank you for your efforts!

    Specifically for RFC3986Uri I see that the only difference between
    the `parse` named constructor and the constructor is that the former
    will return `null` instead of throwing an exception. But it is not
    clear if both methods can work with partial URI. What is the
    expected result of

    new Rfc3986Uri('?query#fragment');
As you supposed, Uri\Rfc3986Uri can parse such a relative URI no matterwhich method is used, while Uri\WhatWgUri will throw an exception/returnnull. That's why I'm still evaluating the possibility of calling thelatter class "URL" in order to make it clear that the scheme is required.
The naming question initially came up during an internal PHP Foundationdiscussion where Tim proposed that the auxiliary WHATWG related classes(WhatWgError, WhatWgErrorType) should be put into a separate Uri\WhatWgsub namespace. However, it was not clear for me whether it's a good ideato also put the main URI representations into their respective subnamespaces (so that we would have Uri\Rfc3986\Uri and Uri\WhatWg\Uri),because this way one should use an alias if they want to use bothclasses in the same file, and I neither like the idea of usingUri\Rfc3986\Rfc3986Uri andUri\WhatWg\WhatWgUri, because it's completelyinconsistent with the latest practices. That's why I'm nowleaning towards using Uri\Rfc3986\Uri and Uri\WhatWg\Url: thisway, there's a very clear distinction about the expected URIformat, while the classes can be put into a separate namespaces withoutclass name clash. Additionally, class names would become shorter, easierto write and comprehend.
    I also think that the RFC should emphasized that the RFC3986 URI is
    only **parsing** the URI and not validating the URI like the
    WHATWGUri counterpart. the following URI will pass without issue

    new Rfc3986('https:example.com <http://example.com>');

    this is a valid RFC3986 URI but it is clearly not a valid http URL.
Hm, thanks again for finding this gotcha. Yes, this is also a differencebetween the two specifications: while RFC3986 will resolve example.com<http://example.com> as a path (since "//" after the scheme wouldindicate that example.com <http://example.com> is part of the authoritycomponent), WHATWG will automatically resolve the input URI as "https://example.com/ <https://example.com/>", making it a valid HTTP URL infact. Fortunately, the behavior of both classes are in line with theirrespective specifications. In case of RFC 3986, the spec says:
A parser of the generic URI syntax can parse any URI reference into
its major components.  Once the scheme is determined, further
scheme-specific parsing can be performed on the components.  In other
words, the URI generic syntax is a superset of the syntax of all URI
schemes.
So the underlying parser doesn't do the scheme specific processing --which is understandable. IMO that's why it's useful to allowthe extension of URI classes so that the child implementations can dofurther processing at will. Alternatively, I could imagine addingsupport for scheme-specific processors: i.e. an array of aUri\SchemeProcessor interface instances could be passed to URIs and themethods of the relevant class based on the URI's scheme would beexecuted when necessary (during parsing, normalization, etc). This is apossible rabbit hole again, so I don't want to include this in thecurrent proposal, but I think it's an interesting possibility.
Another topic I wanted to bring up is encoding and decoding of URIcomponents. This problem was found by Arnaud during an offlinediscussion. Let me quote my interpretation of his words that I added tothe RFC a few days ago (https://wiki.php.net/rfc/url_parsing_api#how_special_characters_are_handled <https://wiki.php.net/rfc/url_parsing_api#how_special_characters_are_handled>):
    Encoding and decoding special characters is a crucial aspect of
    URI parsing. For this purpose, both RFC 3986 and WHATWG use percent-
    encoding <https://en.wikipedia.org/wiki/Percent-encoding> (i.e. the
    |%| character is encoded as |%25|). However, the two standards
    differ significantly in this regard:

    RFC 3986 defines that “URIs that differ in the replacement of an
    unreserved character with its corresponding percent-encoded US-
    ASCII octet are equivalent”, which means that percent-encoded
    characters and their decoded form are equivalent. On the contrary,
    WHATWG defines URL equivalence by the equality of the serialized
    URLs, and never decodes percent-encoded characters, except in the
    host. This implies that percent-encoded characters are not
    equivalent to their decoded form (except in the host).

    The difference between RFC 3986 and WHATWG comes from the fact that
    the point of view of a maintainer of the WHATWG specification is
    that webservers may legitimately choose to consider encoded and
    decoded paths distinct, and a standard cannot force them not to do
    so <https://github.com/whatwg/url/
    issues/606#issuecomment-926395864>. This is a substantial BC break
    compared to RFC 3986, and it is actually a source of confusion among
    users of the WHATWG specification based on the large number of
    tickets related to this question.
Currently, we are brainstorming how to best resolve this problem. It isvery important to specify exactly what kind of representation peopleshould expect when they invoke a getter, so Arnaud suggested that weshould have a fine-grained APi by adding a $mode enum parameter to thegetters with the following possible values:
    ComponentMode::Raw: return the raw value, exactly as the component
    is represented in the URL (as if we just returned a substr() of the url)
    ComponentMode::PercentDecoded: Raw, but every percent-encoded
    character is decoded
    ComponentMode::WhatWGNormalized and RFC3986Normalized: The value
    normalized exactly as specified in the specs. This may or may not
    percent-decode (or do so partially), it depends on the spec. There
    are two different modes for that because the specs do not agree on
    how to normalize, and the consumer may want to rely on one or the
    other. Although the URI could infer which mode to use based on what
    parser was used. I don't know which is more useful.
    ComponentMode::PercentDecodedNormalized: This one is wrong if we
    have more than normalization mode, but I think that we should at
    least have a mode that combines percent-decoding and normalization.
I'm not yet sure I prefer this idea, and there are surely technicalissues with this (as far as I see now, doing so would require the usageof double the amount of memory for a single object than it's currentlyneeded). Of course, if we didn't have a common interface, then thiswould be much less of a problem... So getting rid of the interface wouldalso be an option, because it looks like that trying to align bothspecifications according to the same interface seems more and moredifficult as I get more and more insights about the edge cases. On theother hand, I'm not sure it's a good outcome that PHP users would haveto explicitly choose whether their code uses either RFC 3986 or WHATWG(and they have to possibly convert URIs back and forth between the twospecifications).
Regards,
Máté


Hi Máté,
Thanks for the thorough explanation as where the RFC is at the moment.

My biggest takeaway from your explanation is that currently we aretrying to unify something that IMHO can not be unify.

RFC3986 is a parsing RFC whose goal is to lay down the foundation forother RFCs to validate scheme specific URI. It needs to be generic withall the caveat that comes from it being a generic specification.

WHATWG URL standard goal is to almost always succeed or to never fail,depending on how one sees it. The specification is geared towardparsing, validating and normalizing URL.The goal is to allow the HTTP client to rescue most if not all the HTTPcall even those that were badly issued.

Since both specifications end goal are different they are bound to treatURI their mean for their own specific goal differently. Creating asystem which can, at the same time, please servers (RFC3986) and clients(WHATWG) goals is impossible, at the moment, IMHO. In the end we willeither displease one side or both and most certainly confuse the PHPdeveloper and provide an unecessary complex feature that no one willwant to use.

IMHO the current unique interface is a premature optimization. RFC3986should have its own interface or base class and the same is true for theWHATWG URL.


Once this is clear I think all the other issues then are resolved.

- comparing both types of URL becomes meaningless (it should throw oralways return false)- comparing two URLs from the same type should no longer suffer from theencoding/decoding issues (you no longer need to deploy a complex andsomehow hard to debug/understand encoding system that does not exist inany other language).


```php
use Uri\Rfc3986Uri;
use Uri\WhatWgUri;

new Rfc3986Uri("http://example.com";)->equals(newWhatWgUri("http://ExAMple.com";));

// should return false or throw

new Rfc3986Uri("http://example.com";)->equals(newRfc3986Uri("http://ExAMple.com";));

// should return true
```

Last but not least one of the biggest issue with `parse_url` is that itlacks support for i18n domain name and it seems that the currentimplementation for Rfc3986Uri does not either. I wouldexpect a class that supports RFC3986/RFC3987 out of the box or that usesan enum to specify which RFC needs to be followed.


```php

new Uri\Rfc3986\Uri(uri: "path?query", base: "https://example.com";,version: Uri\Ietf::Rfc3987);

```

What do you think ? IMHO RFC3987 should be the default value to allowmost people on earth to safely use the URL wihout having to explicitlyspecify the RFC used for parsing. Again the WHATWG URL does not sufferfrom this issue but that's because it was built knowing about it whichis not the case of RFC3986 until RFC3987 was brought into the light!


Best regards,
Ignace

Re: [PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing API

Reply via email to