Re: [PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing API

Máté Kocsis Sat, 15 Mar 2025 15:07:01 -0700

Hi Dennis,


> This is a late thought, and surely amenable to a later RFC, but I was
> thinking about the get/set path methods and the issue of the / and %2F.
>
>  - If we exposed `getPathIterator()` or `getPathSegments()` could we not
> report these in their fully-decoded forms? That is, because the path
> segments are separated by some invocation or array element, they could be
> decoded?
>  - Probably more valuably, if `withPath()` accepted an array, could we not
> allow fully non-escaped PHP strings as path segments which the URL class
> could safely and by-default handle the escaping for the caller?
>

Yes, these are very good ideas, and actually they are in line with how I
would imagine a second iteration. Probably, getPathSegments() could return
the "%2F" (percent-encoded form of "/") percent-decoded, sure. But the rest
of the reserved characters will also be an issue, since they can also appear
within the path (i.e. "&" inside "Document & Settings" etc.)
percent-encoded. So percent decoding of reserved characters should still be
taken into account.

Right now, if someone haphazardly joins path segments in order to set
> `withPath()` they will likely be unaware of that nuance and get the path
> wrong. On the grand scale of things, I suspect this is a really minor risk.
> However, if they could send in an array then they would never need to be
> aware of that nuance in order to provide a fully-reliable URL, up to the
> class rejecting path segments which cannot be represented.
>

Yes, consuming an array is also a good idea, but for the same reason as
above, it's not enough to take care of correctly percent-encoding "/" in
order
to have a valid URI as a result.  (Of course I'm still talking about RFC
3986, WHATWG still performs automatic percent-encoding)


>
> The HTML5 library has `::createFromString()` instead of `parse()`. Did you
> consider following this form? It doesn’t seem that important, but could be
> a nice improvement in consistency among the newer spec-compliant APIs.
> Further, I think `createFromString()` is a little more obvious in intent,
> as `parse()` is so generic.
>
> Given the issues around equivalence, what about `isEquivalent()` instead
> of `equals()`? In the RFC I think you have been careful to use the
> “equivalence” terminology, but then in the actual interface we fall back to
> `equals()` and lose some of the nuance.
>

In my implementation, I tried to choose terminology that people are
familiar with instead of using the technicus terminus of URIs. Instead of
recompose(), I used toString() (or some variant of it), instead of
isEquivalent(),
I used equals(). Parse() is probably an outlier, since it's the
correct name of the exact process. But in any case, I consider these names
adequately short, and I think they very clearly convey their intent. Using
the technicus
terminus would probably even more suit those who have deep familiarity with
URIs, but this group will likely be the minority forever. For the rest of
the people, the current names make more sense, so I'd prefer keeping them
as-is.


>
> Something about not implementing `getRawScheme()` and friends in the
> WHATWG class seems off. Your rationale makes sense, but then I wonder what
> the problem is in exposing the raw untranslated components, particularly
> since the “raw” part of the name already suggests some kind of danger or
> risk in using it as some semantic piece.
>

Hm, interesting remark. Do I understand correctly that you are suggesting
to expose getRawScheme() and getRawHost() with their original value? If so,
then this has technical challenges: the WHATWG parser doesn't store
the original value of these two components, so they are effectively lost
when automatically transformation happens during parsing. But this is
normal, since the WHATWG specification doesn't really care about the
original value of these components.


> Tim brought up the naming of `getHost()` and `getHostForDisplay()` as well
> as the correspondence with the `toString()` methods. I’m not sure if it was
> overlooked or I missed the followup, but I wonder what your thoughts are on
> passing an enum to these methods indicating the rendering context. Here’s
> why: I see developers reach for the first method that looks right. In this
> case, that would almost always be `getHost()`, yet `getHost()` or
> `toString()` or whatever is going to be inappropriate in many common cases.
> I see two ways of baking in education into the API surface: creating two
> symmetric methods (e.g. `getDisplayableHost()` and
> `getNonDisplayableHost()`); or requiring an enum forcing the choice (e.g.
> `getHost( ForDisplay | ForNonDisplay )`). In the case on an enum this could
> be equally applied across all of the relevant methods where such a
> distinction exists. On one hand this could be seen as forcing callers to
> make a choice, but on the other hand it can also be seen as a safeguard
> against an extremely-common foot-gun, making such an easy oversight
> impossible.
>

I am myself also a bit lost on the countless names that I tried out in the
implementation, but I think I had toHumanFriendlyString() and
toDisplayFriendlyString() methods at some point. These then ended up being
toString() and toDisplayString() after some iterations. I would be ok with
renaming getHost() and toString() so that their names suggest they don't
use IDNA, but I'd clearly need a good enough suggestion, since neither
"MachineFriendly", nor "NonDisplayable" sound like the best alternative for
me. I was also considering using getIdnaHost() and toIdnaString(), but I
realized these are the worst looking names I have come up with so far.


>
> Just this week I stumbled upon an issue with escaping the hash/fragment
> part of a URL. I think that browsers used to decode percent-encodings in
> the fragment but they all stopped and this was removed from the WHATWG HTML
> spec [no-percent-escaping]. The RFC currently shows `getFragment()`
> decoding percent-encoded fragments, However, I believe that the WHATWG URL
> spec only indicates percent-encoding when _setting_ the fragment. You can
> test this in a browser with the following example: Chrome, Firefox, and
> Safari exhibit the same behavior.
>
>     u = new URL(window.location)
>     u.hash = ‘one and two’;
>     u.hash === ‘#one%20and%20two’;
>     u.toString() === ‘….#one%20and%20two’;
>
> So I think it may be more accurate and consistent to handle
> `Whatwg\Url::getFragment` in the same way as `getScheme()`. When setting a
> fragment we should percent-encode the appropriate characters, but when
> reading it, we should never interpret those characters — it should always
> return the “raw” value of the fragment.
>
> [no-percent-escaping]: https://github.com/whatwg/url/issues/344
>
>
 Thank you for the suggestion and for noticing this problem. I believe you
must have read a version of the RFC where I was still trying to find out
the correct percent-decoding rules for WHATWG. At some point, I was
completely misunderstanding what the specification prescribed, so I had to
make quite some changes in the RFC regarding this aspect + finally I
managed to describe elaborately the reasoning behind the choices. Now I
think the rules make sense.

Yes, my implementation automatically percent-encodes the input when parsing
or modifying a WHATWG URL. You are also right that WHATWG never
percent-decodes the output due to the following reason:

... the point of view of a maintainer of the WHATWG specification is that
> webservers may legitimately choose to consider encoded and decoded paths
> distinct, and a standard cannot force them not to do so.


The said author made this clear in multiple comments, but this one is
linked in the RFC:
https://github.com/whatwg/url/issues/606#issuecomment-926395864

So basically all the non-raw getters return a value that is considered by
WHATWG non-equivalent with the original input. This is also explained in
the "Component retrieval" section in more detail now (
https://wiki.php.net/rfc/url_parsing_api#component_retrieval). I hope

 Regards,
Máté

Re: [PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing API

Reply via email to