Re: Problem downloading a website from archive.org

David Niklas Thu, 13 Mar 2025 18:40:05 -0700

Hello,

Wget can be used to DL from archive.org, but larger sites will not work
because archive.org stops allowing access after a certain amount of time
(After a few days).


Also, archive.org recently changed their policy so that if you timeout a
URL, then that URL may not be accessed again for the whole day. This
breaks wget terribly. I hope to write them about this.

Wget2, as I think I meantioned in the list already, changes the behavior
from wget and grabs waaay too much. I wanted to fix this bug, but have
yet to code it (sorry!)

But assuming you can put up with these difficulties, here's what you need:

wget -NEkrlXXX -t XXX --timeout XXX --reject-regex
'http.*http.*http|\.html?.*\.html?.*\.html?|www\..*www\..*www\.'
--accept-regex
'(.*\.(css|gif|png|jpe?g|webp|svg)$|https?://web\.archive\.org/web/[^
*]+/https?://?(i0.wp.com|i[0-9].wp.com|s[0-9].wp.com|([0-9]\.)?bp.blogspot.com|www.blogger.com|www.blogblog.com|lh[0-9]\.googleusercontent.com|fonts.googleapies.com|(ssl|www|fonts).gstatic.com|(www[0-9]*?\.)?URL))'
 'URL'

(You'll need to eliminate the new lines in the above text.)

N for timestamps, which are needed most of the time. E to change the
extention of the file, which is necessary far too often. k to convert the
URLs. rl for recusive and how far to go.

The reject regex is minimal, it just prevents recursive downloading of
other sites -- you'd be surprised how many times this has to be used.

The accept-regex ensures that wget stays on the straight and narrow path
of only getting the side and the page-requires from other sites, such as
images or sites like wp, blogspot, etc..

You'll have to change XXX and URL to whatever number you think is best
for XXX and the URL you're using for URL minus the www and http/https
portions.

You're welcome,
David

Re: Problem downloading a website from archive.org

Reply via email to