On 2/18/19 12:02 AM, Andres Valloud wrote: > Hi, so I ran wget like this: > > wget --no-check-certificate -dcrNl inf $baseUrl/root/pub/mods/2012/ -P > $baseLocal -o wget-mods-2012.log > > Looking at the log, '1f43' appears (I think) as a consequence of -l inf, > because .../mods/2012/ has a reference to .../mods/, which leads wget to > read the entire .../mods/ index.
Use -np / --no-parent if you don't want to ascend to the parent directory. > According to my understanding of the log file, wget then collects all > the possible URLs from .../mods/. It is here that, after what seems > like thousands of file, a single merge log entry shows '1f43' (some path > parts elided). '1f43' is part of a 'chunked' download. I made some tests printing out the raw received payload of /root/pub/mods/index.html. Seeing this in a downloaded file is clearly a bug. But I can't reproduce it with your command sequence. The different index files in /root/pub/mods/ all have a size of 997015 here. Even after several retries. Maybe you can send me via PM a full working command sequence using a fresh / clean directory. To further reduce the downloads, try with -R '*.mp3,*.xm,*.ogg,*.mod,*.it,*.spc'. Regards, Tim > .../root/pub/mods/index.html?C=N;O=D: > merge(‘.../root/pub/mods/?C=N;O=D’, ‘lizardking_-_quest.mp31f43’) -> > .../root/pub/mods/lizardking_-_quest.mp31f43 > appending ‘.../root/pub/mods/lizardking_-_quest.mp31f43’ to urlpos. > > Then I issued the command (some path parts elided) > > wget --no-check-certificate .../root/pub/mods/ > > which resulted in a 974kb index.html file that has no occurrences of > '1f43' (more on this request down below). > > I wondered whether this could be happening because there are .html files > that *do* have '1f43' already downloaded in the local downloading > directory. That is, will wget look at existing files, or will it > download them from scratch? But the log file seems to indicate the > index.html was downloaded from scratch, not examined from disk. > > The "bad" request looks like this (some path parts elided): > > ---request begin--- > GET /root/pub/mods/?C=N;O=D HTTP/1.1^M > Referer: .../root/pub/mods/^M > If-Modified-Since: Sun, 10 Feb 2019 02:33:09 GMT^M > Range: bytes=998575-^M > User-Agent: Wget/1.20.1 (linux-gnu)^M > Accept: */*^M > Accept-Encoding: identity^M > Host: saphirjd.me^M > Connection: Keep-Alive^M > ^M > ---request end--- > HTTP request sent, awaiting response... > ---response begin--- > HTTP/1.1 200 OK^M > Date: Sat, 16 Feb 2019 21:51:21 GMT^M > Server: Apache/2.4.23 (Win64) OpenSSL/1.0.2h^M > Keep-Alive: timeout=2, max=18^M > Connection: Keep-Alive^M > Transfer-Encoding: chunked^M > Content-Type: text/html;charset=UTF-8^M > ^M > ---response end--- > 200 OK > Length: unspecified [text/html] > Saving to: ‘.../root/pub/mods/index.html?C=N;O=D’ > > 0K .......... .......... .......... .......... .......... 234K > 50K .......... .......... .......... .......... .......... 11.6M > 100K .......... .......... .......... .......... .......... 14.4M > 150K .......... .......... .......... .......... .......... 238K > 200K .......... .......... .......... .......... .......... 657K > 250K .......... .......... .......... .......... .......... 11.3M > 300K .......... .......... .......... .......... .......... 8.44M > 350K .......... .......... .......... .......... .......... 397K > 400K .......... .......... .......... .......... .......... 627K > 450K .......... .......... .......... .......... .......... 2.38M > 500K .......... .......... .......... .......... .......... 4.47M > 550K .......... .......... .......... .......... .......... 3.46M > 600K .......... .......... .......... .......... .......... 477K > 650K .......... .......... .......... .......... .......... 4.14M > 700K .......... .......... .......... .......... .......... 717K > 750K .......... .......... .......... .......... .......... 3.50M > 800K .......... .......... .......... .......... .......... 3.01M > 850K .......... .......... .......... .......... .......... 4.40M > 900K .......... .......... .......... .......... .......... 2.69M > 950K .......... .......... ... 68.9K=1.4s > > Last-modified header missing -- time-stamps turned off. > 2019-02-16 13:51:25 (717 KB/s) - ‘.../root/pub/mods/index.html?C=N;O=D’ > saved [998575] > > Loaded .../root/pub/mods/index.html?C=N;O=D (size 998575). > > > The "good" request looks like this: > > ---request begin--- > GET /root/pub/mods/ HTTP/1.1^M > User-Agent: Wget/1.20.1 (linux-gnu)^M > Accept: */*^M > Accept-Encoding: identity^M > Host: saphirjd.me^M > Connection: Keep-Alive^M > ^M > ---request end--- > HTTP request sent, awaiting response... > ---response begin--- > HTTP/1.1 200 OK^M > Date: Sun, 17 Feb 2019 22:42:04 GMT^M > Server: Apache/2.4.23 (Win64) OpenSSL/1.0.2h^M > Keep-Alive: timeout=2, max=25^M > Connection: Keep-Alive^M > Transfer-Encoding: chunked^M > Content-Type: text/html;charset=UTF-8^M > ^M > ---response end--- > 200 OK > Registered socket 5 for persistent reuse. > Length: unspecified [text/html] > Saving to: ‘index.html.1’ > > 0K .......... .......... .......... .......... .......... 71.1K > 50K .......... .......... .......... .......... .......... 221K > 100K .......... .......... .......... .......... .......... 241K > 150K .......... .......... .......... .......... .......... 232K > 200K .......... .......... .......... .......... .......... 4.81M > 250K .......... .......... .......... .......... .......... 1.64M > 300K .......... .......... .......... .......... .......... 249K > 350K .......... .......... .......... .......... .......... 2.49M > 400K .......... .......... .......... .......... .......... 3.71M > 450K .......... .......... .......... .......... .......... 258K > 500K .......... .......... .......... .......... .......... 1.41M > 550K .......... .......... .......... .......... .......... 1.46M > 600K .......... .......... .......... .......... .......... 2.32M > 650K .......... .......... .......... .......... .......... 340K > 700K .......... .......... .......... .......... .......... 2.19M > 750K .......... .......... .......... .......... .......... 4.10M > 800K .......... .......... .......... .......... .......... 2.68M > 850K .......... .......... .......... .......... .......... 3.17M > 900K .......... .......... .......... .......... .......... 3.22M > 950K .......... .......... ... 2.07M=2.1s > > 2019-02-17 14:42:09 (453 KB/s) - ‘index.html.1’ saved [997015] > > > So I examined the "bad" html file. Unlike the "good" file, the "bad" > file starts like this (contents enclosed by ====== bars): > > ====================================================================== > 13a > <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> > <html><head> > <title>416 Requested Range Not Satisfiable</title> > </head><body> > <h1>Requested Range Not Satisfiable</h1> > <p>None of the range-specifier values in the Range > request-header field overlap the current extent > of the selected resource.</p> > </body></html> > > 0 > > HTTP/1.1 200 OK > Date: Sun, 10 Feb 2019 02:33:04 GMT > Server: Apache/2.4.23 (Win64) OpenSSL/1.0.2h > Keep-Alive: timeout=2, max=24 > Connection: Keep-Alive > Transfer-Encoding: chunked > Content-Type: text/html;charset=UTF-8 > > ee3 > ====================================================================== > > > The "13a" and "ee3" characters are present in the file. This data also > seems to explain why the file saved to disk is about 1kb larger than the > file downloaded individually. It looks like the index.html file saved > to disk contains (i.e. begins with) garbage from a different request > that ended in 416. After that prolog of apparent junk, the file proper > seems to begin as expected --- but it also has several occurrences of > '1f43'. > > A vimdiff run on bad.html and good.html shows some order differences, > seemingly a table replaced with '1f43', and things of that nature. The > structure of the differences is not immediately obvious, as there are > very large sections that differ seemingly because the file was served in > different order. > > Andres. > > > On 2/17/19 12:15, Tim Rühsen wrote: >> On 16.02.19 23:02, Andres Valloud wrote: >>> Tim, >>> >>> I limited the data from 99gb to 3.3gb, and just to the directory where >>> I've seen the problem occurs. The strange string '1f43' appears in this >>> limited setup. The '1f43' substring seems to appear deterministically >>> depending on the file name (I have not checked *every* occurrence by >>> hand). >>> >>> How should I track this down? >> >> I'd use -d -olog and leave away -k. If 1f43 still appears, we know it's >> not because of wget's parsing or conversion. In this case it#s from the >> server... check in which file 1f43 appears and find the request in the >> log file. >> >> Then try to download that file with a single (non-recursive) wget >> command. Check if 1f43 appears in there. If it doesn't, compare both >> requests to see the difference. >> >> Let us know the results. >> >> Regards, Tim >> >>> >>> Andres. >>> >>> On 2/14/19 04:03, Tim Rühsen wrote: >>>> On 2/14/19 12:25 PM, Andres Valloud wrote: >>>>> Tim, >>>>> >>>>> On 2/14/19 02:03, Tim Rühsen wrote: >>>>>>> I looked at the downloaded html files with grep. They do contain >>>>>>> the >>>>>>> substring "1f43", seemingly after a ^M character (I did not check >>>>>>> every >>>>>>> single occurrence). Sometimes, the ^M character is within a file >>>>>>> name >>>>>>> such as this: >>>>>>> >>>>>>> <tr><td valign="top"><img src="https://some.url/icons/mp3ogg.png^M >>>>>>> 1f43^M >>>>>>> " >>>>>> >>>>>> If this is contained in the HTML file, then 'mp3ogg.png1f43' seems >>>>>> correct. ^M is a Carriage Return (Microsoft uses ^M plus linefeed for >>>>>> End-Of-Line (EOL). In a HTML file, EOL has no meaning - parsers >>>>>> simply >>>>>> ignore it. This is nothing that can be addressed with >>>>>> --restrict-file-names. >>>>>> >>>>>> But to make sure, look at the original file by downloading it with >>>>>> 'wget >>>>>> <URL>'. Does the file have the above 'lf43'/^M stuff in it as well >>>>>> ? If >>>>>> so, we can't do much about it. >>>>>> >>>>>> If all looks ok in there, please attach both files so we can compare >>>>>> and >>>>>> possibly reproduce. >>>>>> >>>>>> If you set the 'User-Agent' header to e.g. "Mozilla/5.0 (X11; Linux >>>>>> x86_64; rv:65.0) Gecko/20100101 Firefox/65.0", the server thinks the >>>>>> request is coming via Firefox. >>>>>> curl and wget have both the --user-agent option for this. >>>>>> >>>>>> Do you get a different file when using that option ? >>>>> >>>>> There was one additional detail to make this work. Instead of >>>>> placing a >>>>> request for index.html, I had to ask curl to get just the directory >>>>> name >>>>> ending with a slash. Then the server responded with (essentially) >>>>> index.html. >>>> >>>> A web server might give different content on 'dir', 'dir/' and >>>> 'dir/index.html'. This is sometimes puzzling and as you can see, 'dir/' >>>> can't be used as filename - so we use 'dir/index.html' for that. Which >>>> is not correct if the server serves 'dir/index.php' when we request >>>> 'dir/'. >>>> >>>>> >>>>> Both curl and wget retrieve index.html contents without '1f43' when >>>>> asking for just that URL. vimdiff says the retrieved files are >>>>> identical. >>>> >>>> Try to start with this URL using your original wget command line. You >>>> could add a quota (-Q) to limit the amount of data. In the hope to >>>> reproduce your issue with far less files/data to be downloaded. >>>> >>>>> I am at a loss as to how to explain how the '1f43' problem appears >>>>> when >>>>> asking wget to update the mirror of the site (rather than >>>>> downloading a >>>>> single file). I'll look at the log file tomorrow and see if I get >>>>> more >>>>> ideas. >>>> >>>> Try to reduce the needed amount of data to reproduce it. >>>> >>>> Regards, Tim >>>> >> >
signature.asc
Description: OpenPGP digital signature