Am Dienstag, 8. Oktober 2013, 15:07:51 schrieb Giuseppe Scrivano: > Tim Rühsen <[email protected]> writes: > > I added two links/urls to follow in index.html, now there are three in > > total. All three links/urls point to the same host, but have different > > host encodings (plain international text, punycoding, percent escaping). > > > > Wget should recognize these three codings as being the same and thus I > > removed the -H (host spanning) option to verify that. > > > > Now, Wget fails this test, I guess it needs a fix. > > > > Regards, Tim > > > > From 2e6f527121497b3b148496a9a9c774451d2e0017 Mon Sep 17 00:00:00 2001 > > From: Tim Ruehsen <[email protected]> > > Date: Mon, 7 Oct 2013 23:37:42 +0200 > > Subject: [PATCH] improved Test-idn-robots.px > > > > --- > > > > tests/ChangeLog | 5 +++++ > > tests/Test-idn-robots.px | 27 ++++++++++++++++++++++++++- > > 2 files changed, 31 insertions(+), 1 deletion(-) > > thanks for your test. The IRI support is a bit of a mess and I am not > sure how this issue should be fixed: > > Should we check if the two domains are the same in recur.c (somewhere > near line 633)? It means that we will need to check there for > different encodings and convert among them. Another solution would be > that append_url stores the url in a specific format. > > Probably the latter solution allows us to also deal with page specific > locales when it is specified. > > Have you already looked into this issue? Do you have any > idea/suggestion?
I already solved this issue in this experimental tool Mget where I put the
URI/IRI parser into a library. I just can offer to contribute code from those
source to Wget/FSF. Maybe you take a look and see what fits for Wget (since
Mget does the same as Wget, it should fit).
The code for mget_iri_parse() is in
https://github.com/rockdaboot/mget/blob/master/libmget/iri.c
Mget 'normalizes' all URI/IRIs by
- decode percent encoding
- encode to utf-8
- parsing into host/path/query etc.
- encoding host with toASCII() (libidn2+libunistring or libidn) to ascii form
via mget_str_to_ascii(iri->host)
>From than on, this ascii form is taken as the host name for directories, DNS,
HTTP, comparing etc.
If i can give you a helping hand, contact me.
Regards, Tim
signature.asc
Description: This is a digitally signed message part.
