Re: Problem downloading a website from archive.org

Paul Trafford Thu, 13 Mar 2025 21:17:29 -0700

Hello Tim,

Websites on archive.org or, more specifically, web.archive.org, are, asyou've observed, stored piecemeal as snapshots. When browsing, theWayback Machine stitches the snapshots together.


The problem of retrieval for the likes of Wget is explained by Archive Team
https://wiki.archiveteam.org/index.php?title=Restoring

As an attempted solution, I have developed a prototype tool,MakeStaticSite that runs Wget iteratively, downloading snapshotsselectively to minimise repetition, then merging them into a canonical form.

https://makestaticsite.sh/
https://github.com/paultraf/makestaticsite

Otherwise, there are various approaches, APIs and tools.  See e.g.,
https://archive.org/help/wayback_api.php

Regards,

Paul

Paul Trafford
Oxford, UK

On 12/03/2025 22:57, timsc...@timscrim.co.uk wrote:

Hi Everyone

I am trying to download a complete website from archive.org using Wget but I 
have run into a problem.

If you are a human and you are exploring an old website on archive.org, you may 
notice that sometimes when you click on a link from one page on the website to 
another, the datestamp part of the URL changes. You can also end up on the same 
page as you were previously but with a different datestamp.

This is not much of a problem if you are a human but it a problem for 
webcrawlers such as wget because they can end up duplicating some parts of a 
website many times and not reaching other parts of a website for a very long 
time. This is the problem I am having.

Do any of you know the solution to this problem?

Thank you very much.

Kind regards

Tim

P.S.  I am sorry if this a duplicate but I previously posted it before 
subscribing so I don't know if it went through.

Re: Problem downloading a website from archive.org

Reply via email to