#D would most likely be the preferred way if it's possible, however it sounds both brittle and the "works differently if not built with IDN support" gives it the kind of "it depends" quality that one perhaps not want from this API?
In essence I think it boils down to the use case of extracting the URL, since it was given by the caller in the first place so it should already know the URL and most likely also in a format preferred by the user (thinking that the caller got the URL from the user in some way or form so it should most likely be in the preferred form already), but then there is of course the need to see it due to a redirect. The beauty of #B is that even if the URI is potentially ugly, its at least consistent and can be used for binary comparisons, something that probably would avoid some CVE in the calling software down the line. So perhaps the better solution would be to always do #B and then also have #E - a new option for extracting "display friendly url" that tries to do #D id built with IDN but will fallback to #B or #A if not, since it will be used for display only then some inconsistency should be more tolerable. But then of course that means one more option, and the added risk that some caller will use this new option for URL comparisons anyway so "complicated" as always... /HH Den lör 9 okt. 2021 kl 11:42 skrev Daniel Stenberg via curl-library < curl-library@lists.haxx.se>: > Hello friends. > > Let me take you through a bug, my current work and the little dilemma I'm > facing in regards to how to "canonicalize" host names in URLs! I'll end > the > mail with a question about a possible solution I've thought of. > > # Not parsing percent-encoded host names in URLs > > $ curl https://%63url.se/ > curl: (6) Could not resolve host: %63url.se > > instead of: > > $ curl https://%63url.se/ > [content from https://curl.se] > > Issue: https://github.com/curl/curl/issues/7830 > PR: https://github.com/curl/curl/pull/7834 > > ## Obvious first take > > Make sure that the URL parser **decodes** percent-encoded host names. %41 > becomes `A` etc. > > The parser rejects "control codes" while decoding. %00, %0a and %0d > makes the > host name illegal. > > ## Canonical host name > > The URL API can also *extract* the full URL so it needs to be able to > reverse > the process and here begins the challenges. > > My first simplistic (or maybe *naive*) approach works like this: > > Setting `https://%63url.se/` <http://63url.se/> is extracted again as ` > https://curl.se/` <https://curl.se/> but > setting `https://%c0.se/` <http://c0.se/> is extracted as `https://% > c0.se/` <http://c0.se/> (since anything > non-ASCII is not "URL compliant"). > > ## IDN input > > Enter IDN. Internation Domain Names. They are specified outside of the > regular URL spec (RFC 3986) and they are specified using non-ASCII byte > codes. > > Example name: `räksmörgås.se <http://xn--rksmrgs-5wao1o.se>` (clients > puny-encode this name to > `xn--rksmrgs-5wao1o.se` for DNS etc). > > Since this host/URL uses non-ASCII letters, the naive approch mentioned > above > would then, when the URL API is used to extract this again, use a > sequence of > percent-encoded UTF-8 `r%C3%A4ksm%C3%B6rg%C3A5s.se`. > > It would **not** extract back to `räksmörgås.se > <http://xn--rksmrgs-5wao1o.se>`, which probably is what a > user will expect. > > Next-level complication: mix in percent-encoding to the IDN name: > > `r%c3%a4ksmörgås.se <http://xn--a4ksmrgs-g0a1o.se>` > > The two percent-encoded bytes is UTF-8 sequence for `ä`, which makes this > host name work the same way. > > ## IDN output > > How do we know how to encode the host name when the user wants to > extract it? > > Alternatives I can think of: > > ### A) Don't > > Store the originally provided name and use that for retrieval as well. > This > is bad as then the same URL with differently encoded host names will > appear > as two different ones. Users probably will not expect nor appreciate > this. > > ### B) Always > > Always percent-encode (this is what the PR currently does). It makes the > host > name canonical and it still works IDN wise, but the retrieved URL is > ugly and > user hostile. > > ### C) Puny-encode > > Return the **puny-encoded** version of the name if it was an IDN name, > otherwise percent-encode. Makes the host name canonical, it still works > IDN > wise, but the retrieved URL is ugly and user hostile. Just possibly a > little > less hostile than version B. An upside could be that a puny-code version > of > the host name works even with clients that don't speak IDN. This method > then > works differently if libcurl was built with or without IDN support. > > ### D) Heuristics > > If the host name was a valid IDN name, then return that name without > encoding, otherwise perecent-encode. This makes `r%c3%a4ksmörgås.se > <http://xn--a4ksmrgs-g0a1o.se>` as input > generate `räksmörgås.se <http://xn--rksmrgs-5wao1o.se>` as output. > This method then works differently if > libcurl was built with or without IDN support. > > > > Can we make version (D) work and would that be preferred? > > -- > > / daniel.haxx.se > | Commercial curl support up to 24x7 is available! > | Private help, bug fixes, support, ports, new features > | https://curl.se/support.html-- > Unsubscribe: https://lists.haxx.se/listinfo/curl-library > Etiquette: https://curl.haxx.se/mail/etiquette.html >
-- Unsubscribe: https://lists.haxx.se/listinfo/curl-library Etiquette: https://curl.haxx.se/mail/etiquette.html