Hi Dennis,
> This is a late thought, and surely amenable to a later RFC, but I was > thinking about the get/set path methods and the issue of the / and %2F. > > - If we exposed `getPathIterator()` or `getPathSegments()` could we not > report these in their fully-decoded forms? That is, because the path > segments are separated by some invocation or array element, they could be > decoded? > - Probably more valuably, if `withPath()` accepted an array, could we not > allow fully non-escaped PHP strings as path segments which the URL class > could safely and by-default handle the escaping for the caller? > Yes, these are very good ideas, and actually they are in line with how I would imagine a second iteration. Probably, getPathSegments() could return the "%2F" (percent-encoded form of "/") percent-decoded, sure. But the rest of the reserved characters will also be an issue, since they can also appear within the path (i.e. "&" inside "Document & Settings" etc.) percent-encoded. So percent decoding of reserved characters should still be taken into account. Right now, if someone haphazardly joins path segments in order to set > `withPath()` they will likely be unaware of that nuance and get the path > wrong. On the grand scale of things, I suspect this is a really minor risk. > However, if they could send in an array then they would never need to be > aware of that nuance in order to provide a fully-reliable URL, up to the > class rejecting path segments which cannot be represented. > Yes, consuming an array is also a good idea, but for the same reason as above, it's not enough to take care of correctly percent-encoding "/" in order to have a valid URI as a result. (Of course I'm still talking about RFC 3986, WHATWG still performs automatic percent-encoding) > > The HTML5 library has `::createFromString()` instead of `parse()`. Did you > consider following this form? It doesn’t seem that important, but could be > a nice improvement in consistency among the newer spec-compliant APIs. > Further, I think `createFromString()` is a little more obvious in intent, > as `parse()` is so generic. > > Given the issues around equivalence, what about `isEquivalent()` instead > of `equals()`? In the RFC I think you have been careful to use the > “equivalence” terminology, but then in the actual interface we fall back to > `equals()` and lose some of the nuance. > In my implementation, I tried to choose terminology that people are familiar with instead of using the technicus terminus of URIs. Instead of recompose(), I used toString() (or some variant of it), instead of isEquivalent(), I used equals(). Parse() is probably an outlier, since it's the correct name of the exact process. But in any case, I consider these names adequately short, and I think they very clearly convey their intent. Using the technicus terminus would probably even more suit those who have deep familiarity with URIs, but this group will likely be the minority forever. For the rest of the people, the current names make more sense, so I'd prefer keeping them as-is. > > Something about not implementing `getRawScheme()` and friends in the > WHATWG class seems off. Your rationale makes sense, but then I wonder what > the problem is in exposing the raw untranslated components, particularly > since the “raw” part of the name already suggests some kind of danger or > risk in using it as some semantic piece. > Hm, interesting remark. Do I understand correctly that you are suggesting to expose getRawScheme() and getRawHost() with their original value? If so, then this has technical challenges: the WHATWG parser doesn't store the original value of these two components, so they are effectively lost when automatically transformation happens during parsing. But this is normal, since the WHATWG specification doesn't really care about the original value of these components. > Tim brought up the naming of `getHost()` and `getHostForDisplay()` as well > as the correspondence with the `toString()` methods. I’m not sure if it was > overlooked or I missed the followup, but I wonder what your thoughts are on > passing an enum to these methods indicating the rendering context. Here’s > why: I see developers reach for the first method that looks right. In this > case, that would almost always be `getHost()`, yet `getHost()` or > `toString()` or whatever is going to be inappropriate in many common cases. > I see two ways of baking in education into the API surface: creating two > symmetric methods (e.g. `getDisplayableHost()` and > `getNonDisplayableHost()`); or requiring an enum forcing the choice (e.g. > `getHost( ForDisplay | ForNonDisplay )`). In the case on an enum this could > be equally applied across all of the relevant methods where such a > distinction exists. On one hand this could be seen as forcing callers to > make a choice, but on the other hand it can also be seen as a safeguard > against an extremely-common foot-gun, making such an easy oversight > impossible. > I am myself also a bit lost on the countless names that I tried out in the implementation, but I think I had toHumanFriendlyString() and toDisplayFriendlyString() methods at some point. These then ended up being toString() and toDisplayString() after some iterations. I would be ok with renaming getHost() and toString() so that their names suggest they don't use IDNA, but I'd clearly need a good enough suggestion, since neither "MachineFriendly", nor "NonDisplayable" sound like the best alternative for me. I was also considering using getIdnaHost() and toIdnaString(), but I realized these are the worst looking names I have come up with so far. > > Just this week I stumbled upon an issue with escaping the hash/fragment > part of a URL. I think that browsers used to decode percent-encodings in > the fragment but they all stopped and this was removed from the WHATWG HTML > spec [no-percent-escaping]. The RFC currently shows `getFragment()` > decoding percent-encoded fragments, However, I believe that the WHATWG URL > spec only indicates percent-encoding when _setting_ the fragment. You can > test this in a browser with the following example: Chrome, Firefox, and > Safari exhibit the same behavior. > > u = new URL(window.location) > u.hash = ‘one and two’; > u.hash === ‘#one%20and%20two’; > u.toString() === ‘….#one%20and%20two’; > > So I think it may be more accurate and consistent to handle > `Whatwg\Url::getFragment` in the same way as `getScheme()`. When setting a > fragment we should percent-encode the appropriate characters, but when > reading it, we should never interpret those characters — it should always > return the “raw” value of the fragment. > > [no-percent-escaping]: https://github.com/whatwg/url/issues/344 > > Thank you for the suggestion and for noticing this problem. I believe you must have read a version of the RFC where I was still trying to find out the correct percent-decoding rules for WHATWG. At some point, I was completely misunderstanding what the specification prescribed, so I had to make quite some changes in the RFC regarding this aspect + finally I managed to describe elaborately the reasoning behind the choices. Now I think the rules make sense. Yes, my implementation automatically percent-encodes the input when parsing or modifying a WHATWG URL. You are also right that WHATWG never percent-decodes the output due to the following reason: ... the point of view of a maintainer of the WHATWG specification is that > webservers may legitimately choose to consider encoded and decoded paths > distinct, and a standard cannot force them not to do so. The said author made this clear in multiple comments, but this one is linked in the RFC: https://github.com/whatwg/url/issues/606#issuecomment-926395864 So basically all the non-raw getters return a value that is considered by WHATWG non-equivalent with the original input. This is also explained in the "Component retrieval" section in more detail now ( https://wiki.php.net/rfc/url_parsing_api#component_retrieval). I hope Regards, Máté