> I'm actually thinking about an accessible archive so they old post
> remain available in some form.
Yes so I understood.
> But creating a 'just-in-case' backup is
> good idea.
Not least because it allows the post-disaster creation of an accessible
> it would be straightforward to wget everything as a zillion
> > html files ...
> Not quite that easy: Yahoo applies rate limiting on their rest endpoints.
I suggested wget as it has options to cope with that issue. I recommend
Together these would insert a random wait of 33 to 99 sec between files.
Just over half the gaps are >1min which helps stay under the radar, as does
the random interval. Vary the figure as required:
66 will wget about 1300 files per day. For an 18yr archive assuming a dozen
posts per day that would take 2 months to download. (adjust in proportion
to actual average traffic over the history - adjust to suit)
Long time to stay up for a Windows machine, reasonable for Linux.
A wait of 10 would take under a fortnight but 8000 files/day might be
noticed. Maybe try it to start with?
(uptime on the ancient laptop that is now running Linux and used as my
router is over a year so a two or three month fetch is doable; and a
raspberry PI with an attached external hard drive would do it easily and
use less power)
The advantage of taking a precautionary backup is that there is (probably!)
no need to hurry. Take it slow and you won't make take too much bandwidth
from other Yahoo customers, won't make Yahoo's problems worse, and won't
fall foul of their rate limiter.
to figure out which options you need to define your recursive download. You
can avoid picking up graphic files for example.
Once it's running I suggest taking a
Iook at its download directory tree after a few hours and then once a day
to confirm it's still running and not downloading unwanted stuff.
You probably know that wget is a standard command line utility on GNU /
Various people have compiled it as an .exe for the Windows command line,
some are listed at