On 2/14/19 12:25 PM, Andres Valloud wrote: > Tim, > > On 2/14/19 02:03, Tim Rühsen wrote: >>> I looked at the downloaded html files with grep. They do contain the >>> substring "1f43", seemingly after a ^M character (I did not check every >>> single occurrence). Sometimes, the ^M character is within a file name >>> such as this: >>> >>> <tr><td valign="top"><img src="https://some.url/icons/mp3ogg.png^M >>> 1f43^M >>> " >> >> If this is contained in the HTML file, then 'mp3ogg.png1f43' seems >> correct. ^M is a Carriage Return (Microsoft uses ^M plus linefeed for >> End-Of-Line (EOL). In a HTML file, EOL has no meaning - parsers simply >> ignore it. This is nothing that can be addressed with >> --restrict-file-names. >> >> But to make sure, look at the original file by downloading it with 'wget >> <URL>'. Does the file have the above 'lf43'/^M stuff in it as well ? If >> so, we can't do much about it. >> >> If all looks ok in there, please attach both files so we can compare and >> possibly reproduce. >> >> If you set the 'User-Agent' header to e.g. "Mozilla/5.0 (X11; Linux >> x86_64; rv:65.0) Gecko/20100101 Firefox/65.0", the server thinks the >> request is coming via Firefox. >> curl and wget have both the --user-agent option for this. >> >> Do you get a different file when using that option ? > > There was one additional detail to make this work. Instead of placing a > request for index.html, I had to ask curl to get just the directory name > ending with a slash. Then the server responded with (essentially) > index.html.
A web server might give different content on 'dir', 'dir/' and 'dir/index.html'. This is sometimes puzzling and as you can see, 'dir/' can't be used as filename - so we use 'dir/index.html' for that. Which is not correct if the server serves 'dir/index.php' when we request 'dir/'. > > Both curl and wget retrieve index.html contents without '1f43' when > asking for just that URL. vimdiff says the retrieved files are identical. Try to start with this URL using your original wget command line. You could add a quota (-Q) to limit the amount of data. In the hope to reproduce your issue with far less files/data to be downloaded. > I am at a loss as to how to explain how the '1f43' problem appears when > asking wget to update the mirror of the site (rather than downloading a > single file). I'll look at the log file tomorrow and see if I get more > ideas. Try to reduce the needed amount of data to reproduce it. Regards, Tim
signature.asc
Description: OpenPGP digital signature