Hi Paul

Thank you very much indeed for your very informative and helpful reply and for 
the link to your MakeStaticSite tool. I will try it out.

Kind regards

Tim
  ----- Original Message ----- 
  From: Paul Trafford 
  To: timsc...@timscrim.co.uk ; bug-wget@gnu.org 
  Sent: Thursday, March 13, 2025 10:22 AM
  Subject: Re: Problem downloading a website from archive.org


  Hello Tim,


  Websites on archive.org or, more specifically, web.archive.org, are, as 
you've observed, stored piecemeal as snapshots. When browsing, the Wayback 
Machine stitches the snapshots together.


  The problem of retrieval for the likes of Wget is explained by Archive Team
  https://wiki.archiveteam.org/index.php?title=Restoring


  As an attempted solution, I have developed a prototype tool, MakeStaticSite 
that runs Wget iteratively, downloading snapshots selectively to minimise 
repetition, then merging them into a canonical form.
  https://makestaticsite.sh/
  https://github.com/paultraf/makestaticsite


  Otherwise, there are various approaches, APIs and tools.  See e.g.,
  https://archive.org/help/wayback_api.php


  Regards, 

  Paul

  Paul Trafford
  Oxford, UK



  On 12/03/2025 22:57, timsc...@timscrim.co.uk wrote:

Hi Everyone

I am trying to download a complete website from archive.org using Wget but I 
have run into a problem.

If you are a human and you are exploring an old website on archive.org, you may 
notice that sometimes when you click on a link from one page on the website to 
another, the datestamp part of the URL changes. You can also end up on the same 
page as you were previously but with a different datestamp.

This is not much of a problem if you are a human but it a problem for 
webcrawlers such as wget because they can end up duplicating some parts of a 
website many times and not reaching other parts of a website for a very long 
time. This is the problem I am having.

Do any of you know the solution to this problem?

Thank you very much.

Kind regards

Tim

P.S.  I am sorry if this a duplicate but I previously posted it before 
subscribing so I don't know if it went through.


Reply via email to