I don't if this will help, but might be an option.

If the site is still operational through a browser, this software 
https://www.httrack.com/ will capture, literally, everything, and make it 
available for offline viewing.

________________________________________
From: Code for Libraries [[email protected]] on behalf of Demian Katz 
[[email protected]]
Sent: Wednesday, March 04, 2020 10:37 AM
To: [email protected]
Subject: [CODE4LIB] WARC --> static HTML?

Hello, everyone –

I’ve been struggling with a use case that feels like it can’t be unique to my 
situation. Wondering if anyone else has solved this!

We’ve decommissioned an old dynamic site, and we still want to make the content 
available in a static form. It was a large and complex site with a lot of 
pages, and after trying a variety of solutions, we ended up harvesting it all 
into a WARC file. This is great for archival purposes, but we’re struggling 
with presentation.

The problem with serving content from a WARC is that it seems to be unbearably 
slow in every solution we try. (And when I say unbearably, I mean “40 minutes 
to load one page using pywb” – not kidding).

I assume that this slowness has to do with dynamically navigating around in a 
multi-gigabyte file to retrieve things… but really all we want to do is serve 
up static content.

Is there some tool that can simply unpack a WARC into a directory of static 
files that can be navigated quickly? It seems like this should be possible, but 
I’m coming up empty in searching.

And just to be clear: I understand that unpacking a WARC probably won’t retain 
all of the richness of detail that dynamic retrieval from the WARC can provide, 
and I certainly don’t plan to throw away the WARC… but for people who just want 
to quickly navigate content from the most recently-crawled version of the old 
site, I want a solution that will perform acceptably, and I haven’t found it 
yet.

Thanks for any and all advice! 😊

- Demian

Reply via email to