[CODE4LIB] wget archiving for dummies

2014-10-06 Thread Eric Phetteplace
Hey C4L,

If I wanted to archive a Wordpress site, how would I do so?

More elaborate: our library recently got a donation of a remote Wordpress
site, sitting one directory below the root of a domain. I can tell from a
cursory look it's a Wordpress site. We've never archived a website before
and I don't need to do anything fancy, just download a workable copy as it
presently exists. I've heard this can be as simple as:

wget -m $PATH_TO_SITE_ROOT

but that's not working as planned. Wget's convert links feature doesn't
seem to be quite so simple; if I download the site, disable my network
connection, then host locally, some 20 resources aren't available. Mostly
images which are under the same directory. Possibly loaded via AJAX. Advice?

(Anticipated) pertinent advice: I shouldn't be doing this at all, we should
outsource to Archive-It or similar, who actually know what they're doing.
Yes/no?

Best,
Eric


Re: [CODE4LIB] wget archiving for dummies

2014-10-06 Thread Alex Armstrong
I wanted a quick-and-dirty solution to archiving our old LibGuides site 
a few months ago.


wget was my first port of call also. I don't have good notes as to what 
went wrong, but I ended up using httrack:

http://www.httrack.com/

It basically worked out of the box.

HTH,
Alex


On 10/06/2014 09:44 AM, Eric Phetteplace wrote:

Hey C4L,

If I wanted to archive a Wordpress site, how would I do so?

More elaborate: our library recently got a donation of a remote Wordpress
site, sitting one directory below the root of a domain. I can tell from a
cursory look it's a Wordpress site. We've never archived a website before
and I don't need to do anything fancy, just download a workable copy as it
presently exists. I've heard this can be as simple as:

wget -m $PATH_TO_SITE_ROOT

but that's not working as planned. Wget's convert links feature doesn't
seem to be quite so simple; if I download the site, disable my network
connection, then host locally, some 20 resources aren't available. Mostly
images which are under the same directory. Possibly loaded via AJAX. Advice?

(Anticipated) pertinent advice: I shouldn't be doing this at all, we should
outsource to Archive-It or similar, who actually know what they're doing.
Yes/no?

Best,
Eric


Re: [CODE4LIB] wget archiving for dummies

2014-10-06 Thread Carol Kassel
Hi Eric,

I have created static versions of several WordPress sites. Here's a link to
one of the sites:

http://futureofthebook.org/occurrence/

As you will see, some of the functionality is lost, such as the search and
commenting features. But the content is preserved, and now I don't have to
maintain WordPress for this site (for which its need for interactivity is
long past).

Here is the wget command I used:

wget \
 --recursive \
 --no-clobber \
 --page-requisites \
 --html-extension \
 --convert-links \
 --restrict-file-names=windows \
 --include /occurrence \
 --no-parent \
 http://www.futureofthebook.org/occurrence/ \
 --domains www.futureofthebook.org

I'm not certain that I needed all of these switches, but some of them were
necessary.

After I did the wget, I put the set of files into a new location and then
tested, tested, tested. Some links didn't work properly, and so I had to do
some manual work to get a fully functioning site. Nothing is perfect.

Once I had everything working the way I wanted, I pointed my Web server to
the new location of the site, backed up my WordPress database and files,
and saved everything as a tar file, just in case.

Good luck!

Best wishes,

Carol

On Mon, Oct 6, 2014 at 2:44 AM, Eric Phetteplace phett...@gmail.com wrote:

 Hey C4L,

 If I wanted to archive a Wordpress site, how would I do so?

 More elaborate: our library recently got a donation of a remote Wordpress
 site, sitting one directory below the root of a domain. I can tell from a
 cursory look it's a Wordpress site. We've never archived a website before
 and I don't need to do anything fancy, just download a workable copy as it
 presently exists. I've heard this can be as simple as:

 wget -m $PATH_TO_SITE_ROOT

 but that's not working as planned. Wget's convert links feature doesn't
 seem to be quite so simple; if I download the site, disable my network
 connection, then host locally, some 20 resources aren't available. Mostly
 images which are under the same directory. Possibly loaded via AJAX.
 Advice?

 (Anticipated) pertinent advice: I shouldn't be doing this at all, we should
 outsource to Archive-It or similar, who actually know what they're doing.
 Yes/no?

 Best,
 Eric




-- 
Carol Kassel
NYU Digital Library Technology Services
c...@nyu.edu
(212) 992-9246
dlib.nyu.edu


Re: [CODE4LIB] wget archiving for dummies

2014-10-06 Thread Alexander Duryee
I've used wget extensively for web preservation.  It's a remarkably
powerful tool, but there are some notable features/caveats to be aware of:

1) You absolutely should use the --warc-file=NAME and
--warc-header=STRING options.  These will create a WARC file alongside
the usual wget filedump, which captures essential information (process
provenance, server request/responses, raw data before wget adjusts it) for
preservation.  The warc-header option includes user-added metadata, such as
the name, purpose, etc. of the capture.  It's likely that you won't use the
WARC for access, but keeping it as a preservation copy of the site is
invaluable.

2) Javascript, AJAX queries, links in rich media, and such are completely
opaque to wget.  As such, you'll need to QC aggressively to ensure that you
captured everything you intended to.  My method was to run a generic wget
capture[1], QC it, and manually download missing objects.  I'd then pass
everything back into wget to create a complete WARC file containing the
full capture.  It's janky, but gets the job done.

3) Do be careful of commenting options, which often turn into spider
traps.  The latest versions of wget have regex support, so you can
blacklist certain URLs that you know will trap the crawler.

If the site is proving stubborn, I can take a look off-list.

Best of luck,
Alex

[1] I've used the following successfully: wget --user-agent=AmigaVoyager/3.2
(AmigaOS/MC680x0) --warc-file=FILENAME --warc-header=STRING
--page-requisites -e robots=off --random-wait --wait=5 --recursive --level=0
--no-parent --convert-links URL


Re: [CODE4LIB] wget archiving for dummies

2014-10-06 Thread Little, James Clarence IV
I love that user agent.

This the wget command I've used to back up sites that have pretty urls:

wget -v --mirror -p --html-extension -e robots=off --base=./ -k -P ./ URL


– Jamie

From: Code for Libraries CODE4LIB@LISTSERV.ND.EDU on behalf of Alexander 
Duryee alexanderdur...@gmail.com
Sent: Monday, October 06, 2014 11:51 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] wget archiving for dummies

I've used wget extensively for web preservation.  It's a remarkably
powerful tool, but there are some notable features/caveats to be aware of:

1) You absolutely should use the --warc-file=NAME and
--warc-header=STRING options.  These will create a WARC file alongside
the usual wget filedump, which captures essential information (process
provenance, server request/responses, raw data before wget adjusts it) for
preservation.  The warc-header option includes user-added metadata, such as
the name, purpose, etc. of the capture.  It's likely that you won't use the
WARC for access, but keeping it as a preservation copy of the site is
invaluable.

2) Javascript, AJAX queries, links in rich media, and such are completely
opaque to wget.  As such, you'll need to QC aggressively to ensure that you
captured everything you intended to.  My method was to run a generic wget
capture[1], QC it, and manually download missing objects.  I'd then pass
everything back into wget to create a complete WARC file containing the
full capture.  It's janky, but gets the job done.

3) Do be careful of commenting options, which often turn into spider
traps.  The latest versions of wget have regex support, so you can
blacklist certain URLs that you know will trap the crawler.

If the site is proving stubborn, I can take a look off-list.

Best of luck,
Alex

[1] I've used the following successfully: wget --user-agent=AmigaVoyager/3.2
(AmigaOS/MC680x0) --warc-file=FILENAME --warc-header=STRING
--page-requisites -e robots=off --random-wait --wait=5 --recursive --level=0
--no-parent --convert-links URL


Re: [CODE4LIB] wget archiving for dummies

2014-10-06 Thread Alexander Duryee
I was dealing with a lot of sites that would shunt the user around based on
their user agent (e.g. very old sites that had completely different pages
for Netscape and IE), so I needed something neutral that wouldn't get
caught in a browser-specific branch.  Suffice to say, nothing ever checks
for Amiga browsers :)

On Mon, Oct 6, 2014 at 12:08 PM, Little, James Clarence IV 
j.lit...@miami.edu wrote:

 I love that user agent.

 This the wget command I've used to back up sites that have pretty urls:

 wget -v --mirror -p --html-extension -e robots=off --base=./ -k -P ./ URL


 – Jamie
 
 From: Code for Libraries CODE4LIB@LISTSERV.ND.EDU on behalf of
 Alexander Duryee alexanderdur...@gmail.com
 Sent: Monday, October 06, 2014 11:51 AM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] wget archiving for dummies

 I've used wget extensively for web preservation.  It's a remarkably
 powerful tool, but there are some notable features/caveats to be aware of:

 1) You absolutely should use the --warc-file=NAME and
 --warc-header=STRING options.  These will create a WARC file alongside
 the usual wget filedump, which captures essential information (process
 provenance, server request/responses, raw data before wget adjusts it) for
 preservation.  The warc-header option includes user-added metadata, such as
 the name, purpose, etc. of the capture.  It's likely that you won't use the
 WARC for access, but keeping it as a preservation copy of the site is
 invaluable.

 2) Javascript, AJAX queries, links in rich media, and such are completely
 opaque to wget.  As such, you'll need to QC aggressively to ensure that you
 captured everything you intended to.  My method was to run a generic wget
 capture[1], QC it, and manually download missing objects.  I'd then pass
 everything back into wget to create a complete WARC file containing the
 full capture.  It's janky, but gets the job done.

 3) Do be careful of commenting options, which often turn into spider
 traps.  The latest versions of wget have regex support, so you can
 blacklist certain URLs that you know will trap the crawler.

 If the site is proving stubborn, I can take a look off-list.

 Best of luck,
 Alex

 [1] I've used the following successfully: wget
 --user-agent=AmigaVoyager/3.2
 (AmigaOS/MC680x0) --warc-file=FILENAME --warc-header=STRING
 --page-requisites -e robots=off --random-wait --wait=5 --recursive
 --level=0
 --no-parent --convert-links URL



Re: [CODE4LIB] wget archiving for dummies

2014-10-06 Thread Eric Phetteplace
Thanks for the advice all. I'm trying httrack now but the other wget
options are good to know about, especially Alex's point about saving a WARC
file.

One clarification: I definitely don't want to deal with the database, nor
can I. We don't have admin or server access. Even if we did, I don't think
preserving the db would be wise or necessary.

Best,
Eric

On Mon, Oct 6, 2014 at 9:24 AM, Alexander Duryee alexanderdur...@gmail.com
wrote:

 I was dealing with a lot of sites that would shunt the user around based on
 their user agent (e.g. very old sites that had completely different pages
 for Netscape and IE), so I needed something neutral that wouldn't get
 caught in a browser-specific branch.  Suffice to say, nothing ever checks
 for Amiga browsers :)

 On Mon, Oct 6, 2014 at 12:08 PM, Little, James Clarence IV 
 j.lit...@miami.edu wrote:

  I love that user agent.
 
  This the wget command I've used to back up sites that have pretty urls:
 
  wget -v --mirror -p --html-extension -e robots=off --base=./ -k -P ./
 URL
 
 
  – Jamie
  
  From: Code for Libraries CODE4LIB@LISTSERV.ND.EDU on behalf of
  Alexander Duryee alexanderdur...@gmail.com
  Sent: Monday, October 06, 2014 11:51 AM
  To: CODE4LIB@LISTSERV.ND.EDU
  Subject: Re: [CODE4LIB] wget archiving for dummies
 
  I've used wget extensively for web preservation.  It's a remarkably
  powerful tool, but there are some notable features/caveats to be aware
 of:
 
  1) You absolutely should use the --warc-file=NAME and
  --warc-header=STRING options.  These will create a WARC file alongside
  the usual wget filedump, which captures essential information (process
  provenance, server request/responses, raw data before wget adjusts it)
 for
  preservation.  The warc-header option includes user-added metadata, such
 as
  the name, purpose, etc. of the capture.  It's likely that you won't use
 the
  WARC for access, but keeping it as a preservation copy of the site is
  invaluable.
 
  2) Javascript, AJAX queries, links in rich media, and such are completely
  opaque to wget.  As such, you'll need to QC aggressively to ensure that
 you
  captured everything you intended to.  My method was to run a generic wget
  capture[1], QC it, and manually download missing objects.  I'd then pass
  everything back into wget to create a complete WARC file containing the
  full capture.  It's janky, but gets the job done.
 
  3) Do be careful of commenting options, which often turn into spider
  traps.  The latest versions of wget have regex support, so you can
  blacklist certain URLs that you know will trap the crawler.
 
  If the site is proving stubborn, I can take a look off-list.
 
  Best of luck,
  Alex
 
  [1] I've used the following successfully: wget
  --user-agent=AmigaVoyager/3.2
  (AmigaOS/MC680x0) --warc-file=FILENAME --warc-header=STRING
  --page-requisites -e robots=off --random-wait --wait=5 --recursive
  --level=0
  --no-parent --convert-links URL