Hello Tim,
Websites on archive.org or, more specifically, web.archive.org, are, as
you've observed, stored piecemeal as snapshots. When browsing, the
Wayback Machine stitches the snapshots together.
The problem of retrieval for the likes of Wget is explained by Archive Team
https://wiki.archiveteam.org/index.php?title=Restoring
As an attempted solution, I have developed a prototype tool,
MakeStaticSite that runs Wget iteratively, downloading snapshots
selectively to minimise repetition, then merging them into a canonical form.
https://makestaticsite.sh/
https://github.com/paultraf/makestaticsite
Otherwise, there are various approaches, APIs and tools. See e.g.,
https://archive.org/help/wayback_api.php
Regards,
Paul
Paul Trafford
Oxford, UK
On 12/03/2025 22:57, timsc...@timscrim.co.uk wrote:
Hi Everyone
I am trying to download a complete website from archive.org using Wget but I
have run into a problem.
If you are a human and you are exploring an old website on archive.org, you may
notice that sometimes when you click on a link from one page on the website to
another, the datestamp part of the URL changes. You can also end up on the same
page as you were previously but with a different datestamp.
This is not much of a problem if you are a human but it a problem for
webcrawlers such as wget because they can end up duplicating some parts of a
website many times and not reaching other parts of a website for a very long
time. This is the problem I am having.
Do any of you know the solution to this problem?
Thank you very much.
Kind regards
Tim
P.S. I am sorry if this a duplicate but I previously posted it before
subscribing so I don't know if it went through.