Re: should curl_url_get "normalize" URLs?

Timothe Litt via curl-library Fri, 28 Mar 2025 13:06:05 -0700

On 27-Mar-25 04:10, Daniel Stenberg via curl-library wrote:

Hi team,
The curl_url_get man page [3] says it *normalizes* retrieved URLs. Normalizing in this context means that curl would do its best to return a single consistent representation of a URL even if you would provide different variations as input.
Normalizing helps apps to for example compare URLs or otherwise be more consistent.
This claim turned out to be false [1], as there are multiple details not normalized in the latest libcurl version and I work on a PR [2] to address the shortcomings.
Normalizing URLs is less stragiht-forward than what it may sound. A naive version would decode every URL part, then encode them again and put together a full URL using all the re-encoded pieces.
This however would break URLs in multiple ways, as for example '/' would be encoded to %2F in the path part and '=' would be encoded into %3D in the query part - so it can't be that done simple. Every part more or less has its own set of properties and characters to take into account and treat specially. Not to mention that it is simply more work that requires several more memory allocations to get done etc.
Also, a user might not need/want this normalization to get done. Maybe we need a flag to enable/disable?
Before I complete this work and risk wasting time going down the wrong rabit hole, let me know if you have any thoughts, opinions or feedback on this area.
[1] = https://github.com/curl/curl/issues/16829
[2] = https://github.com/curl/curl/pull/16841
[3] = https://curl.se/libcurl/c/curl_url_get.html

Be careful. %2F is not the same as / in all cases. The rules are messy enough that I won't restate them here, but refer to the the RFCs, e.g. 3986 to start.

Note in section 2.4 <https://datatracker.ietf.org/doc/html/rfc3986#section-2.4>:

When a URI is dereferenced, the components and subcomponents
    significant to the scheme-specific dereferencing process (if any)
    must be parsed and separated before the percent-encoded octets within
    those components can be safely decoded, a*s otherwise the data may be 
mistaken for component delimiters*.

Section 6.1 <https://datatracker.ietf.org/doc/html/rfc3986#section-6.1> discusses Equivalence (and normalization) in depth.

There are both generic and scheme-specific rules and considerations. The scheme RFCs have details.

About the only things that can easily and safely be normalized are the authority (when it's known to be a dns name, and comparing embeded authorization to separate data), ip addresses, and hexadecimal a-f case.

More can be done with scheme-specific knowledge. And even more if you have knowledge of the server (e.g. file systems that are case insensitive/case preserving can have case aliases, but case sensitive file systems will not.) I won't mention the server side aliasing of links (hard and symbolic - and context-dependent)...

While "Do What I Mean" has it's place, there does need to be a mechanism for "Do exactly what I say". Even if the latter means that the user is outsmarting herself. Better that, than the library outsmarting the user with incorrect results.

I don't have the time to look into the man page, the current code, or your PRs. But you asked about "rabbit holes"; these are some of the entrances.


HTH.

Timothe Litt
ACM Distinguished Engineer
--------------------------
This communication may not represent the ACM or my employer's views,
if any, on the matters discussed.

OpenPGP_signature.asc
Description: OpenPGP digital signature

-- 
Unsubscribe: https://lists.haxx.se/mailman/listinfo/curl-library
Etiquette:   https://curl.se/mail/etiquette.html

Re: should curl_url_get "normalize" URLs?

Reply via email to