> From: Tim Ruehsen <[email protected]> > Cc: Eli Zaretskii <[email protected]> > Date: Tue, 15 Dec 2015 11:02:21 +0100 > > I pushed a conversion fix to master.
Thanks! > There is another bug in wget that comes out with > wget -d --local-encoding=cp1255 > 'http://he.wikipedia.org/wiki/%F9._%F9%F4%F8%E4' > > Wget double escapes/converts to UTF-8... Maybe you can address this when you > are working on the code !? You mean, because http redirects to https? Yes, I've seen that already. The simple patch below fixes that. The problem seems to be that wget assumes the redirected URL to be encoded in the same encoding as the original one (which, as described earlier, starts with the local encoding), whereas it is much more reasonable to use the value provided by --remote-encoding. And if the 'if' in the patch looks strange to you, it's rightfully so. Look at this strange logic in set_uri_encoding: /* Set uri_encoding of struct iri i. If a remote encoding was specified, use it unless force is true. */ void set_uri_encoding (struct iri *i, const char *charset, bool force) { DEBUGP (("URI encoding = %s\n", charset ? quote (charset) : "None")); if (!force && opt.encoding_remote) return; I understand the reason to prefer opt.encoding_remote when the 'force' flag is false -- the user-provided remote encoding should take preference. But why return without making sure the URI's encoding is in fact set to that?? I guess there's some assumption that iri->uri_encoding is already set to opt.encoding_remote, but this assumption is certainly false in this case. So I tyhink this function should be changed to actually use opt.encoding_remote, if non-NULL, and otherwise use 'charset' even if 'force' is false. Then the patch below could be simplify to avoid the test. WDYT? Here's the patch I promised. With it, wget survives redirection from http to https and successful retrieves that page. diff --git a/src/retr.c b/src/retr.c index a6a9bd7..6af26a0 100644 --- a/src/retr.c +++ b/src/retr.c @@ -872,9 +872,11 @@ retrieve_url (struct url * orig_parsed, const char *origurl, char **file, xfree (mynewloc); mynewloc = construced_newloc; - /* Reset UTF-8 encoding state, keep the URI encoding and reset + /* Reset UTF-8 encoding state, set the URI encoding and reset the content encoding. */ iri->utf8_encode = opt.enable_iri; + if (opt.encoding_remote) + set_uri_encoding (iri, opt.encoding_remote, true); set_content_encoding (iri, NULL); xfree (iri->orig_url);
