Re: A canonical URL host name dilemma

Henrik Holst via curl-library Sat, 09 Oct 2021 03:24:39 -0700

#D would most likely be the preferred way if it's possible, however it
sounds both brittle and the "works differently if not built with IDN
support" gives it the kind of "it depends" quality that one perhaps not
want from this API?


In essence I think it boils down to the use case of extracting the URL,
since it was given by the caller in the first place so it should already
know the URL and most likely also in a format preferred by the user
(thinking that the caller got the URL from the user in some way or form so
it should most likely be in the preferred form already), but then there is
of course the need to see it due to a redirect.

The beauty of #B is that even if the URI is potentially ugly, its at least
consistent and can be used for binary comparisons, something that probably
would avoid some CVE in the calling software down the line.

So perhaps the better solution would be to always do #B and then also have
#E - a new option for extracting "display friendly url" that tries to do #D
id built with IDN but will fallback to #B or #A if not, since it will be
used for display only then some inconsistency should be more tolerable. But
then of course that means one more option, and the added risk that some
caller will use this new option for URL comparisons anyway so "complicated"
as always...

/HH

Den lör 9 okt. 2021 kl 11:42 skrev Daniel Stenberg via curl-library <
curl-library@lists.haxx.se>:

> Hello friends.
>
> Let me take you through a bug, my current work and the little dilemma I'm
> facing in regards to how to "canonicalize" host names in URLs! I'll end
> the
> mail with a question about a possible solution I've thought of.
>
> # Not parsing percent-encoded host names in URLs
>
>      $ curl https://%63url.se/
>      curl: (6) Could not resolve host: %63url.se
>
> instead of:
>
>      $ curl https://%63url.se/
>      [content from https://curl.se]
>
> Issue: https://github.com/curl/curl/issues/7830
> PR: https://github.com/curl/curl/pull/7834
>
> ## Obvious first take
>
>   Make sure that the URL parser **decodes** percent-encoded host names. %41
>   becomes `A` etc.
>
>   The parser rejects "control codes" while decoding. %00, %0a and %0d
> makes the
>   host name illegal.
>
> ## Canonical host name
>
>   The URL API can also *extract* the full URL so it needs to be able to
> reverse
>   the process and here begins the challenges.
>
>   My first simplistic (or maybe *naive*) approach works like this:
>
>   Setting `https://%63url.se/` <http://63url.se/> is extracted again as `
> https://curl.se/` <https://curl.se/> but
>   setting `https://%c0.se/` <http://c0.se/> is extracted as `https://%
> c0.se/` <http://c0.se/> (since anything
>   non-ASCII is not "URL compliant").
>
> ## IDN input
>
>   Enter IDN. Internation Domain Names. They are specified outside of the
>   regular URL spec (RFC 3986) and they are specified using non-ASCII byte
>   codes.
>
>   Example name: `räksmörgås.se <http://xn--rksmrgs-5wao1o.se>` (clients
> puny-encode this name to
>   `xn--rksmrgs-5wao1o.se` for DNS etc).
>
>   Since this host/URL uses non-ASCII letters, the naive approch mentioned
> above
>   would then, when the URL API is used to extract this again, use a
> sequence of
>   percent-encoded UTF-8 `r%C3%A4ksm%C3%B6rg%C3A5s.se`.
>
>   It would **not** extract back to `räksmörgås.se
> <http://xn--rksmrgs-5wao1o.se>`, which probably is what a
>   user will expect.
>
>   Next-level complication: mix in percent-encoding to the IDN name:
>
>   `r%c3%a4ksmörgås.se <http://xn--a4ksmrgs-g0a1o.se>`
>
>   The two percent-encoded bytes is UTF-8 sequence for `ä`, which makes this
>   host name work the same way.
>
> ## IDN output
>
>   How do we know how to encode the host name when the user wants to
> extract it?
>
>   Alternatives I can think of:
>
> ### A) Don't
>
>   Store the originally provided name and use that for retrieval as well.
> This
>   is bad as then the same URL with differently encoded host names will
> appear
>   as two different ones. Users probably will not expect nor appreciate
> this.
>
> ### B) Always
>
>   Always percent-encode (this is what the PR currently does). It makes the
> host
>   name canonical and it still works IDN wise, but the retrieved URL is
> ugly and
>   user hostile.
>
> ### C) Puny-encode
>
>   Return the **puny-encoded** version of the name if it was an IDN name,
>   otherwise percent-encode. Makes the host name canonical, it still works
> IDN
>   wise, but the retrieved URL is ugly and user hostile. Just possibly a
> little
>   less hostile than version B. An upside could be that a puny-code version
> of
>   the host name works even with clients that don't speak IDN.  This method
> then
>   works differently if libcurl was built with or without IDN support.
>
> ### D) Heuristics
>
>   If the host name was a valid IDN name, then return that name without
>   encoding, otherwise perecent-encode. This makes `r%c3%a4ksmörgås.se
> <http://xn--a4ksmrgs-g0a1o.se>` as input
>   generate `räksmörgås.se <http://xn--rksmrgs-5wao1o.se>` as output.
> This method then works differently if
>   libcurl was built with or without IDN support.
>
>
>
> Can we make version (D) work and would that be preferred?
>
> --
>
>   / daniel.haxx.se
>   | Commercial curl support up to 24x7 is available!
>   | Private help, bug fixes, support, ports, new features
>   | https://curl.se/support.html--
> Unsubscribe: https://lists.haxx.se/listinfo/curl-library
> Etiquette:   https://curl.haxx.se/mail/etiquette.html
>

-- 
Unsubscribe: https://lists.haxx.se/listinfo/curl-library
Etiquette:   https://curl.haxx.se/mail/etiquette.html

Re: A canonical URL host name dilemma

Reply via email to