[CODE4LIB] wget archiving for dummies
Hey C4L, If I wanted to archive a Wordpress site, how would I do so? More elaborate: our library recently got a donation of a remote Wordpress site, sitting one directory below the root of a domain. I can tell from a cursory look it's a Wordpress site. We've never archived a website before and I don't need to do anything fancy, just download a workable copy as it presently exists. I've heard this can be as simple as: wget -m $PATH_TO_SITE_ROOT but that's not working as planned. Wget's convert links feature doesn't seem to be quite so simple; if I download the site, disable my network connection, then host locally, some 20 resources aren't available. Mostly images which are under the same directory. Possibly loaded via AJAX. Advice? (Anticipated) pertinent advice: I shouldn't be doing this at all, we should outsource to Archive-It or similar, who actually know what they're doing. Yes/no? Best, Eric
Re: [CODE4LIB] wget archiving for dummies
I wanted a quick-and-dirty solution to archiving our old LibGuides site a few months ago. wget was my first port of call also. I don't have good notes as to what went wrong, but I ended up using httrack: http://www.httrack.com/ It basically worked out of the box. HTH, Alex On 10/06/2014 09:44 AM, Eric Phetteplace wrote: Hey C4L, If I wanted to archive a Wordpress site, how would I do so? More elaborate: our library recently got a donation of a remote Wordpress site, sitting one directory below the root of a domain. I can tell from a cursory look it's a Wordpress site. We've never archived a website before and I don't need to do anything fancy, just download a workable copy as it presently exists. I've heard this can be as simple as: wget -m $PATH_TO_SITE_ROOT but that's not working as planned. Wget's convert links feature doesn't seem to be quite so simple; if I download the site, disable my network connection, then host locally, some 20 resources aren't available. Mostly images which are under the same directory. Possibly loaded via AJAX. Advice? (Anticipated) pertinent advice: I shouldn't be doing this at all, we should outsource to Archive-It or similar, who actually know what they're doing. Yes/no? Best, Eric
Re: [CODE4LIB] wget archiving for dummies
Hi Eric, I have created static versions of several WordPress sites. Here's a link to one of the sites: http://futureofthebook.org/occurrence/ As you will see, some of the functionality is lost, such as the search and commenting features. But the content is preserved, and now I don't have to maintain WordPress for this site (for which its need for interactivity is long past). Here is the wget command I used: wget \ --recursive \ --no-clobber \ --page-requisites \ --html-extension \ --convert-links \ --restrict-file-names=windows \ --include /occurrence \ --no-parent \ http://www.futureofthebook.org/occurrence/ \ --domains www.futureofthebook.org I'm not certain that I needed all of these switches, but some of them were necessary. After I did the wget, I put the set of files into a new location and then tested, tested, tested. Some links didn't work properly, and so I had to do some manual work to get a fully functioning site. Nothing is perfect. Once I had everything working the way I wanted, I pointed my Web server to the new location of the site, backed up my WordPress database and files, and saved everything as a tar file, just in case. Good luck! Best wishes, Carol On Mon, Oct 6, 2014 at 2:44 AM, Eric Phetteplace phett...@gmail.com wrote: Hey C4L, If I wanted to archive a Wordpress site, how would I do so? More elaborate: our library recently got a donation of a remote Wordpress site, sitting one directory below the root of a domain. I can tell from a cursory look it's a Wordpress site. We've never archived a website before and I don't need to do anything fancy, just download a workable copy as it presently exists. I've heard this can be as simple as: wget -m $PATH_TO_SITE_ROOT but that's not working as planned. Wget's convert links feature doesn't seem to be quite so simple; if I download the site, disable my network connection, then host locally, some 20 resources aren't available. Mostly images which are under the same directory. Possibly loaded via AJAX. Advice? (Anticipated) pertinent advice: I shouldn't be doing this at all, we should outsource to Archive-It or similar, who actually know what they're doing. Yes/no? Best, Eric -- Carol Kassel NYU Digital Library Technology Services c...@nyu.edu (212) 992-9246 dlib.nyu.edu
Re: [CODE4LIB] wget archiving for dummies
I've used wget extensively for web preservation. It's a remarkably powerful tool, but there are some notable features/caveats to be aware of: 1) You absolutely should use the --warc-file=NAME and --warc-header=STRING options. These will create a WARC file alongside the usual wget filedump, which captures essential information (process provenance, server request/responses, raw data before wget adjusts it) for preservation. The warc-header option includes user-added metadata, such as the name, purpose, etc. of the capture. It's likely that you won't use the WARC for access, but keeping it as a preservation copy of the site is invaluable. 2) Javascript, AJAX queries, links in rich media, and such are completely opaque to wget. As such, you'll need to QC aggressively to ensure that you captured everything you intended to. My method was to run a generic wget capture[1], QC it, and manually download missing objects. I'd then pass everything back into wget to create a complete WARC file containing the full capture. It's janky, but gets the job done. 3) Do be careful of commenting options, which often turn into spider traps. The latest versions of wget have regex support, so you can blacklist certain URLs that you know will trap the crawler. If the site is proving stubborn, I can take a look off-list. Best of luck, Alex [1] I've used the following successfully: wget --user-agent=AmigaVoyager/3.2 (AmigaOS/MC680x0) --warc-file=FILENAME --warc-header=STRING --page-requisites -e robots=off --random-wait --wait=5 --recursive --level=0 --no-parent --convert-links URL
Re: [CODE4LIB] wget archiving for dummies
I love that user agent. This the wget command I've used to back up sites that have pretty urls: wget -v --mirror -p --html-extension -e robots=off --base=./ -k -P ./ URL – Jamie From: Code for Libraries CODE4LIB@LISTSERV.ND.EDU on behalf of Alexander Duryee alexanderdur...@gmail.com Sent: Monday, October 06, 2014 11:51 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] wget archiving for dummies I've used wget extensively for web preservation. It's a remarkably powerful tool, but there are some notable features/caveats to be aware of: 1) You absolutely should use the --warc-file=NAME and --warc-header=STRING options. These will create a WARC file alongside the usual wget filedump, which captures essential information (process provenance, server request/responses, raw data before wget adjusts it) for preservation. The warc-header option includes user-added metadata, such as the name, purpose, etc. of the capture. It's likely that you won't use the WARC for access, but keeping it as a preservation copy of the site is invaluable. 2) Javascript, AJAX queries, links in rich media, and such are completely opaque to wget. As such, you'll need to QC aggressively to ensure that you captured everything you intended to. My method was to run a generic wget capture[1], QC it, and manually download missing objects. I'd then pass everything back into wget to create a complete WARC file containing the full capture. It's janky, but gets the job done. 3) Do be careful of commenting options, which often turn into spider traps. The latest versions of wget have regex support, so you can blacklist certain URLs that you know will trap the crawler. If the site is proving stubborn, I can take a look off-list. Best of luck, Alex [1] I've used the following successfully: wget --user-agent=AmigaVoyager/3.2 (AmigaOS/MC680x0) --warc-file=FILENAME --warc-header=STRING --page-requisites -e robots=off --random-wait --wait=5 --recursive --level=0 --no-parent --convert-links URL
Re: [CODE4LIB] wget archiving for dummies
I was dealing with a lot of sites that would shunt the user around based on their user agent (e.g. very old sites that had completely different pages for Netscape and IE), so I needed something neutral that wouldn't get caught in a browser-specific branch. Suffice to say, nothing ever checks for Amiga browsers :) On Mon, Oct 6, 2014 at 12:08 PM, Little, James Clarence IV j.lit...@miami.edu wrote: I love that user agent. This the wget command I've used to back up sites that have pretty urls: wget -v --mirror -p --html-extension -e robots=off --base=./ -k -P ./ URL – Jamie From: Code for Libraries CODE4LIB@LISTSERV.ND.EDU on behalf of Alexander Duryee alexanderdur...@gmail.com Sent: Monday, October 06, 2014 11:51 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] wget archiving for dummies I've used wget extensively for web preservation. It's a remarkably powerful tool, but there are some notable features/caveats to be aware of: 1) You absolutely should use the --warc-file=NAME and --warc-header=STRING options. These will create a WARC file alongside the usual wget filedump, which captures essential information (process provenance, server request/responses, raw data before wget adjusts it) for preservation. The warc-header option includes user-added metadata, such as the name, purpose, etc. of the capture. It's likely that you won't use the WARC for access, but keeping it as a preservation copy of the site is invaluable. 2) Javascript, AJAX queries, links in rich media, and such are completely opaque to wget. As such, you'll need to QC aggressively to ensure that you captured everything you intended to. My method was to run a generic wget capture[1], QC it, and manually download missing objects. I'd then pass everything back into wget to create a complete WARC file containing the full capture. It's janky, but gets the job done. 3) Do be careful of commenting options, which often turn into spider traps. The latest versions of wget have regex support, so you can blacklist certain URLs that you know will trap the crawler. If the site is proving stubborn, I can take a look off-list. Best of luck, Alex [1] I've used the following successfully: wget --user-agent=AmigaVoyager/3.2 (AmigaOS/MC680x0) --warc-file=FILENAME --warc-header=STRING --page-requisites -e robots=off --random-wait --wait=5 --recursive --level=0 --no-parent --convert-links URL
Re: [CODE4LIB] wget archiving for dummies
Thanks for the advice all. I'm trying httrack now but the other wget options are good to know about, especially Alex's point about saving a WARC file. One clarification: I definitely don't want to deal with the database, nor can I. We don't have admin or server access. Even if we did, I don't think preserving the db would be wise or necessary. Best, Eric On Mon, Oct 6, 2014 at 9:24 AM, Alexander Duryee alexanderdur...@gmail.com wrote: I was dealing with a lot of sites that would shunt the user around based on their user agent (e.g. very old sites that had completely different pages for Netscape and IE), so I needed something neutral that wouldn't get caught in a browser-specific branch. Suffice to say, nothing ever checks for Amiga browsers :) On Mon, Oct 6, 2014 at 12:08 PM, Little, James Clarence IV j.lit...@miami.edu wrote: I love that user agent. This the wget command I've used to back up sites that have pretty urls: wget -v --mirror -p --html-extension -e robots=off --base=./ -k -P ./ URL – Jamie From: Code for Libraries CODE4LIB@LISTSERV.ND.EDU on behalf of Alexander Duryee alexanderdur...@gmail.com Sent: Monday, October 06, 2014 11:51 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] wget archiving for dummies I've used wget extensively for web preservation. It's a remarkably powerful tool, but there are some notable features/caveats to be aware of: 1) You absolutely should use the --warc-file=NAME and --warc-header=STRING options. These will create a WARC file alongside the usual wget filedump, which captures essential information (process provenance, server request/responses, raw data before wget adjusts it) for preservation. The warc-header option includes user-added metadata, such as the name, purpose, etc. of the capture. It's likely that you won't use the WARC for access, but keeping it as a preservation copy of the site is invaluable. 2) Javascript, AJAX queries, links in rich media, and such are completely opaque to wget. As such, you'll need to QC aggressively to ensure that you captured everything you intended to. My method was to run a generic wget capture[1], QC it, and manually download missing objects. I'd then pass everything back into wget to create a complete WARC file containing the full capture. It's janky, but gets the job done. 3) Do be careful of commenting options, which often turn into spider traps. The latest versions of wget have regex support, so you can blacklist certain URLs that you know will trap the crawler. If the site is proving stubborn, I can take a look off-list. Best of luck, Alex [1] I've used the following successfully: wget --user-agent=AmigaVoyager/3.2 (AmigaOS/MC680x0) --warc-file=FILENAME --warc-header=STRING --page-requisites -e robots=off --random-wait --wait=5 --recursive --level=0 --no-parent --convert-links URL