[bug #60287] Windows recursive download escapes utf8 URLs twice

Eli Zaretskii Fri, 26 Mar 2021 23:44:03 -0700

Follow-up Comment #8, bug #60287 (project wget):

> Is this because wget first downloads the html file and then reads the
contents off disk


No.  It's because Wget downloads the pages you told it to, and saves them as
disk files.  Any links in the downloaded pages that lead to other pages
produce additional disk files (e.g., if you told Wget to download
recursively).

IOW, the file-name encoding issue happens when a Web page needs to be saved to
a file for some reason.

> If the bytes were downloaded with the correct encoding, and written to the
file system with the correct encoding, I would expect it to be able to parse
the file with the correct encoding.

What is the "correct encoding", though?

> the file `wget-test.html` has no non-ascii characters in it

Of course, it doesn't: the non-ASCII characters appear when we decode the
hex-encoded bytes.



    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?60287>

_______________________________________________
  Message sent via Savannah
  https://savannah.gnu.org/

[bug #60287] Windows recursive download escapes utf8 URLs twice

Reply via email to