On 16.02.19 23:02, Andres Valloud wrote: > Tim, > > I limited the data from 99gb to 3.3gb, and just to the directory where > I've seen the problem occurs. The strange string '1f43' appears in this > limited setup. The '1f43' substring seems to appear deterministically > depending on the file name (I have not checked *every* occurrence by hand). > > How should I track this down?
I'd use -d -olog and leave away -k. If 1f43 still appears, we know it's not because of wget's parsing or conversion. In this case it#s from the server... check in which file 1f43 appears and find the request in the log file. Then try to download that file with a single (non-recursive) wget command. Check if 1f43 appears in there. If it doesn't, compare both requests to see the difference. Let us know the results. Regards, Tim > > Andres. > > On 2/14/19 04:03, Tim Rühsen wrote: >> On 2/14/19 12:25 PM, Andres Valloud wrote: >>> Tim, >>> >>> On 2/14/19 02:03, Tim Rühsen wrote: >>>>> I looked at the downloaded html files with grep. They do contain the >>>>> substring "1f43", seemingly after a ^M character (I did not check >>>>> every >>>>> single occurrence). Sometimes, the ^M character is within a file name >>>>> such as this: >>>>> >>>>> <tr><td valign="top"><img src="https://some.url/icons/mp3ogg.png^M >>>>> 1f43^M >>>>> " >>>> >>>> If this is contained in the HTML file, then 'mp3ogg.png1f43' seems >>>> correct. ^M is a Carriage Return (Microsoft uses ^M plus linefeed for >>>> End-Of-Line (EOL). In a HTML file, EOL has no meaning - parsers simply >>>> ignore it. This is nothing that can be addressed with >>>> --restrict-file-names. >>>> >>>> But to make sure, look at the original file by downloading it with >>>> 'wget >>>> <URL>'. Does the file have the above 'lf43'/^M stuff in it as well ? If >>>> so, we can't do much about it. >>>> >>>> If all looks ok in there, please attach both files so we can compare >>>> and >>>> possibly reproduce. >>>> >>>> If you set the 'User-Agent' header to e.g. "Mozilla/5.0 (X11; Linux >>>> x86_64; rv:65.0) Gecko/20100101 Firefox/65.0", the server thinks the >>>> request is coming via Firefox. >>>> curl and wget have both the --user-agent option for this. >>>> >>>> Do you get a different file when using that option ? >>> >>> There was one additional detail to make this work. Instead of placing a >>> request for index.html, I had to ask curl to get just the directory name >>> ending with a slash. Then the server responded with (essentially) >>> index.html. >> >> A web server might give different content on 'dir', 'dir/' and >> 'dir/index.html'. This is sometimes puzzling and as you can see, 'dir/' >> can't be used as filename - so we use 'dir/index.html' for that. Which >> is not correct if the server serves 'dir/index.php' when we request >> 'dir/'. >> >>> >>> Both curl and wget retrieve index.html contents without '1f43' when >>> asking for just that URL. vimdiff says the retrieved files are >>> identical. >> >> Try to start with this URL using your original wget command line. You >> could add a quota (-Q) to limit the amount of data. In the hope to >> reproduce your issue with far less files/data to be downloaded. >> >>> I am at a loss as to how to explain how the '1f43' problem appears when >>> asking wget to update the mirror of the site (rather than downloading a >>> single file). I'll look at the log file tomorrow and see if I get more >>> ideas. >> >> Try to reduce the needed amount of data to reproduce it. >> >> Regards, Tim >>
signature.asc
Description: OpenPGP digital signature