Re: [CODE4LIB] archiving web pages

2014-01-16 Thread Kari R Smith
As an archivist I would suggest that rather than thinking up all the possible 
requirements, check with your archives staff, your institutional records 
policy, and your archives collections policy to find out what their actual 
requirements are.  Having the full digital content as it was displayed is 
important for preservation.  As archivists part of our job is to represent in 
description what the content is, how is was in context of the time it was 
created and used, and what has been done it to present it to users (over time.) 
 
Ad layout may be different from what the specific ads were.  Taking snapshots 
for the particular ads may be different than having full dynamic 
reconstructions of websites.  Providing non-dynamic PDFs of webpages may not be 
the same as following the navigation pathways through a website.

Kari Smith

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@listserv.nd.edu] On Behalf Of 
Wilhelmina Randtke
Sent: Wednesday, January 15, 2014 10:29 AM
To: CODE4LIB@listserv.nd.edu
Subject: Re: [CODE4LIB] archiving web pages

Agreed, don't focus too much on preserving the presentation for an online 
newspaper.  The text and images are important, but the layout isn't so 
important.

-Wilhelmina Randtke


On Tue, Jan 14, 2014 at 10:59 AM, Kyle Banerjee kyle.baner...@gmail.comwrote:

 IMO, there are many web archiving situations where it is more 
 appropriate to just focus on the content rather than the manifestation of the 
 content.
 Just as you wouldn't expect a 1995 article from the NYT to be 
 displayed as the website was in 1995 or an article in an online 
 database to actually appear like it originally appeared online, it's 
 the content rather than the skin that's relevant in the case of a 
 newspaper. If you make sure it's in a format that can be migrated 
 forward and added to standalone or union systems that provide access to this 
 sort of stuff, you'll be fine.

 kyle


 On Tue, Jan 14, 2014 at 8:48 AM, Kathryn Frederick (Library)  
 kfred...@skidmore.edu wrote:

  Hi,
  I'm trying to develop a strategy for preserving issues our school's
 online
  newspaper. Creating a WARC file of the content seems 
  straightforward, but how will that content fair long-term? Also, how 
  is the WARC served to an end-user? Is there some other method I should look 
  at?
  Thanks in advance for any advice!
  Kathryn
 



Re: [CODE4LIB] archiving web pages

2014-01-15 Thread Stern, Randy
Here is another:
http://wax.lib.harvard.edu/collections/home.do

- Randy

--

Date:Tue, 14 Jan 2014 10:43:18 -0700
From:Robert Sanderson azarot...@gmail.com
Subject: Re: archiving web pages

Here are several to consider:

*
http://www.webarchive.org.uk/wayback/archive/*/http://www.aboutmayfair.co.uk/
*
http://webarchive.loc.gov/lcwa0015/*/http://lawprofessors.typepad.com/adminlaw/
* http://www.padi.cat:8080/wayback/*/http://www.ajberga.cat/
* http://vefsafn.is/index.php?page=english


Hope that helps :)

Rob






On Tue, Jan 14, 2014 at 10:31 AM, Nathan Tallman ntall...@gmail.com wrote:

 Lisa,

 Is your local web archive available online? I'd like to see a production
 example of non-Internet Archive instance of Wayback/Open Wayback.

 Thanks,
 Nathan


Re: [CODE4LIB] archiving web pages

2014-01-15 Thread Wilhelmina Randtke
Agreed, don't focus too much on preserving the presentation for an online
newspaper.  The text and images are important, but the layout isn't so
important.

-Wilhelmina Randtke


On Tue, Jan 14, 2014 at 10:59 AM, Kyle Banerjee kyle.baner...@gmail.comwrote:

 IMO, there are many web archiving situations where it is more appropriate
 to just focus on the content rather than the manifestation of the content.
 Just as you wouldn't expect a 1995 article from the NYT to be displayed as
 the website was in 1995 or an article in an online database to actually
 appear like it originally appeared online, it's the content rather than the
 skin that's relevant in the case of a newspaper. If you make sure it's in a
 format that can be migrated forward and added to standalone or union
 systems that provide access to this sort of stuff, you'll be fine.

 kyle


 On Tue, Jan 14, 2014 at 8:48 AM, Kathryn Frederick (Library) 
 kfred...@skidmore.edu wrote:

  Hi,
  I'm trying to develop a strategy for preserving issues our school's
 online
  newspaper. Creating a WARC file of the content seems straightforward, but
  how will that content fair long-term? Also, how is the WARC served to an
  end-user? Is there some other method I should look at?
  Thanks in advance for any advice!
  Kathryn
 



Re: [CODE4LIB] archiving web pages

2014-01-15 Thread Andrew Darby
If it's doable, I think preserving the whole enchilada is desirable.  For
instance, at my last library, there was a regular assignment where students
needed the print version of old periodicals because they were tasked with
analysing the ads and layouts.  Someone might be interested in web layouts
from the 2000s, and there might be content (again, ads, but also masthead
logos, ???) that might not otherwise be captured.

Andrew


On Wed, Jan 15, 2014 at 10:29 AM, Wilhelmina Randtke rand...@gmail.comwrote:

 Agreed, don't focus too much on preserving the presentation for an online
 newspaper.  The text and images are important, but the layout isn't so
 important.

 -Wilhelmina Randtke


 On Tue, Jan 14, 2014 at 10:59 AM, Kyle Banerjee kyle.baner...@gmail.com
 wrote:

  IMO, there are many web archiving situations where it is more appropriate
  to just focus on the content rather than the manifestation of the
 content.
  Just as you wouldn't expect a 1995 article from the NYT to be displayed
 as
  the website was in 1995 or an article in an online database to actually
  appear like it originally appeared online, it's the content rather than
 the
  skin that's relevant in the case of a newspaper. If you make sure it's
 in a
  format that can be migrated forward and added to standalone or union
  systems that provide access to this sort of stuff, you'll be fine.
 
  kyle
 
 
  On Tue, Jan 14, 2014 at 8:48 AM, Kathryn Frederick (Library) 
  kfred...@skidmore.edu wrote:
 
   Hi,
   I'm trying to develop a strategy for preserving issues our school's
  online
   newspaper. Creating a WARC file of the content seems straightforward,
 but
   how will that content fair long-term? Also, how is the WARC served to
 an
   end-user? Is there some other method I should look at?
   Thanks in advance for any advice!
   Kathryn
  
 




-- 
Andrew Darby
Head, Web  Emerging Technologies
University of Miami Libraries


Re: [CODE4LIB] archiving web pages

2014-01-15 Thread Alexander Duryee
There's always the option of capturing a WARC of the newspaper as the
preservation master for dark storage, and generating PDFs for access via
your CMS.  If you're in ContentDM already, then a PDF would be much easier
to use (both on the back and frontends).

The provenance metadata of WARC is too important not to capture, but I
agree that it can be awkward to use for access.  A hybrid approach of
generating WARCs and PDFs may be best - the PDF will handle most of your
use cases, and any further questions/issues (e.g. rendering questions,
research into interactive advertisements, etc.) can defer to the WARC.
I've used this approach elsewhere, and it was a relief to know that we
could always go back to a WARC file to resolve issues of
provenance/authenticity/content.

--Alex


On Wed, Jan 15, 2014 at 11:52 AM, Andrew Darby darby.li...@gmail.comwrote:

 If it's doable, I think preserving the whole enchilada is desirable.  For
 instance, at my last library, there was a regular assignment where students
 needed the print version of old periodicals because they were tasked with
 analysing the ads and layouts.  Someone might be interested in web layouts
 from the 2000s, and there might be content (again, ads, but also masthead
 logos, ???) that might not otherwise be captured.

 Andrew


 On Wed, Jan 15, 2014 at 10:29 AM, Wilhelmina Randtke rand...@gmail.com
 wrote:

  Agreed, don't focus too much on preserving the presentation for an online
  newspaper.  The text and images are important, but the layout isn't so
  important.
 
  -Wilhelmina Randtke
 
 
  On Tue, Jan 14, 2014 at 10:59 AM, Kyle Banerjee kyle.baner...@gmail.com
  wrote:
 
   IMO, there are many web archiving situations where it is more
 appropriate
   to just focus on the content rather than the manifestation of the
  content.
   Just as you wouldn't expect a 1995 article from the NYT to be displayed
  as
   the website was in 1995 or an article in an online database to actually
   appear like it originally appeared online, it's the content rather than
  the
   skin that's relevant in the case of a newspaper. If you make sure it's
  in a
   format that can be migrated forward and added to standalone or union
   systems that provide access to this sort of stuff, you'll be fine.
  
   kyle
  
  
   On Tue, Jan 14, 2014 at 8:48 AM, Kathryn Frederick (Library) 
   kfred...@skidmore.edu wrote:
  
Hi,
I'm trying to develop a strategy for preserving issues our school's
   online
newspaper. Creating a WARC file of the content seems straightforward,
  but
how will that content fair long-term? Also, how is the WARC served to
  an
end-user? Is there some other method I should look at?
Thanks in advance for any advice!
Kathryn
   
  
 



 --
 Andrew Darby
 Head, Web  Emerging Technologies
 University of Miami Libraries



Re: [CODE4LIB] archiving web pages

2014-01-15 Thread Kyle Banerjee
On Wed, Jan 15, 2014 at 8:52 AM, Andrew Darby darby.li...@gmail.com wrote:

 If it's doable, I think preserving the whole enchilada is desirable.  For
 instance, at my last library, there was a regular assignment where students
 needed the print version of old periodicals because they were tasked with
 analysing the ads and layouts.  Someone might be interested in web layouts
 from the 2000s, and there might be content (again, ads, but also masthead
 logos, ???) that might not otherwise be captured



That often is not possible and that the number of circumstances when it is
will only decrease over time. Except on flat sites designed according to a
physical document model, the platform and the content work together to
provide the experience. A reasonable argument can be made that taking
snapshots of dynamic things is lossier than focusing on the data. With
regards to the ads, what people see has varied dramatically based on a
number of factors for quite awhile.

Even if that weren't true, retaining information just because some academic
could conceivably come up with a use for it is not a good reason to keep
it. Everything in your trash/recycling may be very interesting from an
archaeological point of view at some time, but it's still a good idea to
pitch it.   The shrinking role libraries play in the information sphere is
way too small for us to pay to maintain stuff that has no purpose beyond
meeting a use case that might exist at some indeterminate point in the
future -- especially given the high costs of maintained storage. Fear not.
We will leave no shortage physical and virtual information about ourselves
to future generations.

kyle


Re: [CODE4LIB] archiving web pages

2014-01-15 Thread Nicholas Taylor
+1 to Alex's suggestion to use WARC for the preservation master and 
generate PDFs for access.


While I agree with Kyle that it's ultimately the content that's 
important and that hypothetical researcher needs are inexhaustible, I do 
think there's an advantage to preserving web content in a web-native 
way. Aside from verisimilitude, looking ahead to implementation of 
Memento (http://mementoweb.org/) - a mechanism for adding temporal 
navigation to the web through federated discovery of resources preserved 
in distributed web archives - data stored in WARC will ultimately be 
better integrated into the fabric of the web than PDFs siloed in an 
individual institutional repository.


I also wanted to mention (and encourage addition to!) the Wikipedia list 
of web archiving initiatives: 
http://en.wikipedia.org/wiki/List_of_Web_archiving_initiatives. It 
provides a good overview of many web archiving institutions' programs, 
data formats, technology stacks, and access provisions (including links 
to their Wayback implementations).


~Nicholas
--
Nicholas Taylor
Web Archiving Service Manager
Stanford University Libraries


Re: [CODE4LIB] archiving web pages

2014-01-14 Thread Kyle Banerjee
IMO, there are many web archiving situations where it is more appropriate
to just focus on the content rather than the manifestation of the content.
Just as you wouldn't expect a 1995 article from the NYT to be displayed as
the website was in 1995 or an article in an online database to actually
appear like it originally appeared online, it's the content rather than the
skin that's relevant in the case of a newspaper. If you make sure it's in a
format that can be migrated forward and added to standalone or union
systems that provide access to this sort of stuff, you'll be fine.

kyle


On Tue, Jan 14, 2014 at 8:48 AM, Kathryn Frederick (Library) 
kfred...@skidmore.edu wrote:

 Hi,
 I'm trying to develop a strategy for preserving issues our school's online
 newspaper. Creating a WARC file of the content seems straightforward, but
 how will that content fair long-term? Also, how is the WARC served to an
 end-user? Is there some other method I should look at?
 Thanks in advance for any advice!
 Kathryn



Re: [CODE4LIB] archiving web pages

2014-01-14 Thread L Snider
Hi Kathryn,

Right now the WARC format is considered the best preservation format for
websites/social media, in terms of digital archives. It is our best guess
right now. It will likely will be with us for a long time, because it has
been adopted by most of the major players.

The way I have seen WARCs served up is through Wayback, the manual version
of the Internet Archive's Wayback machine.
http://archive-access.sourceforge.net/projects/wayback/index.html

I have only used Heritrix and Wayback together, so I haven't played with
Wayback and WARCs made another way.

I would stick with WARC in terms of preservation, access is another
story...that would depend on budget, time, etc.

Hope that helps.

Cheers

Lisa
-- 
Lisa Snider
Electronic Records Archivist
Harry Ransom Center
The University of Texas at Austin
P.O. Box 7219
Austin, Texas 78713-7219
P: 512-232-4616
www.hrc.utexas.edu



On Tue, Jan 14, 2014 at 10:48 AM, Kathryn Frederick (Library) 
kfred...@skidmore.edu wrote:

 Hi,
 I'm trying to develop a strategy for preserving issues our school's online
 newspaper. Creating a WARC file of the content seems straightforward, but
 how will that content fair long-term? Also, how is the WARC served to an
 end-user? Is there some other method I should look at?
 Thanks in advance for any advice!
 Kathryn



Re: [CODE4LIB] archiving web pages

2014-01-14 Thread Robert Sanderson
For what it's worth, the latest wayback code is:

https://github.com/iipc/openwayback

And being developed by the IIPC consortium, rather than just the Internet
Archive alone.
It has many additional features, contributed by other members.

It should be used in preference to the sourceforge version, IMO.

Rob




On Tue, Jan 14, 2014 at 10:00 AM, L Snider lsni...@gmail.com wrote:

 Hi Kathryn,

 Right now the WARC format is considered the best preservation format for
 websites/social media, in terms of digital archives. It is our best guess
 right now. It will likely will be with us for a long time, because it has
 been adopted by most of the major players.

 The way I have seen WARCs served up is through Wayback, the manual version
 of the Internet Archive's Wayback machine.
 http://archive-access.sourceforge.net/projects/wayback/index.html

 I have only used Heritrix and Wayback together, so I haven't played with
 Wayback and WARCs made another way.

 I would stick with WARC in terms of preservation, access is another
 story...that would depend on budget, time, etc.

 Hope that helps.

 Cheers

 Lisa
 --
 Lisa Snider
 Electronic Records Archivist
 Harry Ransom Center
 The University of Texas at Austin
 P.O. Box 7219
 Austin, Texas 78713-7219
 P: 512-232-4616
 www.hrc.utexas.edu



 On Tue, Jan 14, 2014 at 10:48 AM, Kathryn Frederick (Library) 
 kfred...@skidmore.edu wrote:

  Hi,
  I'm trying to develop a strategy for preserving issues our school's
 online
  newspaper. Creating a WARC file of the content seems straightforward, but
  how will that content fair long-term? Also, how is the WARC served to an
  end-user? Is there some other method I should look at?
  Thanks in advance for any advice!
  Kathryn
 



Re: [CODE4LIB] archiving web pages

2014-01-14 Thread Francis Kayiwa

On 1/14/2014 11:48 AM, Kathryn Frederick (Library) wrote:

Hi,
I'm trying to develop a strategy for preserving issues our school's online 
newspaper. Creating a WARC file of the content seems straightforward, but how 
will that content fair long-term? Also, how is the WARC served to an end-user? 
Is there some other method I should look at?
Thanks in advance for any advice!


WARC's good but I feel you are asking two questions when you add how 
will you render using WARC. (apologies if I'm not grokking your meaning)


If Skidmore has an IR I'd looking into adding them into your IR and 
render from there (in addition to WARC'ing them)


Cheers,
./fxk

--
Cheap things are of no value, valuable things are not cheap.


Re: [CODE4LIB] archiving web pages

2014-01-14 Thread L Snider
Rob is right on! I included the wrong link, thanks for catching that...

Cheers

Lisa


On Tue, Jan 14, 2014 at 11:04 AM, Robert Sanderson azarot...@gmail.comwrote:

 For what it's worth, the latest wayback code is:

 https://github.com/iipc/openwayback

 And being developed by the IIPC consortium, rather than just the Internet
 Archive alone.
 It has many additional features, contributed by other members.

 It should be used in preference to the sourceforge version, IMO.

 Rob




 On Tue, Jan 14, 2014 at 10:00 AM, L Snider lsni...@gmail.com wrote:

  Hi Kathryn,
 
  Right now the WARC format is considered the best preservation format for
  websites/social media, in terms of digital archives. It is our best guess
  right now. It will likely will be with us for a long time, because it has
  been adopted by most of the major players.
 
  The way I have seen WARCs served up is through Wayback, the manual
 version
  of the Internet Archive's Wayback machine.
  http://archive-access.sourceforge.net/projects/wayback/index.html
 
  I have only used Heritrix and Wayback together, so I haven't played with
  Wayback and WARCs made another way.
 
  I would stick with WARC in terms of preservation, access is another
  story...that would depend on budget, time, etc.
 
  Hope that helps.
 
  Cheers
 
  Lisa
  --
  Lisa Snider
  Electronic Records Archivist
  Harry Ransom Center
  The University of Texas at Austin
  P.O. Box 7219
  Austin, Texas 78713-7219
  P: 512-232-4616
  www.hrc.utexas.edu
 
 
 
  On Tue, Jan 14, 2014 at 10:48 AM, Kathryn Frederick (Library) 
  kfred...@skidmore.edu wrote:
 
   Hi,
   I'm trying to develop a strategy for preserving issues our school's
  online
   newspaper. Creating a WARC file of the content seems straightforward,
 but
   how will that content fair long-term? Also, how is the WARC served to
 an
   end-user? Is there some other method I should look at?
   Thanks in advance for any advice!
   Kathryn
  
 



Re: [CODE4LIB] archiving web pages

2014-01-14 Thread Nathan Tallman
On Tue, Jan 14, 2014 at 12:08 PM, Francis Kayiwa fkay...@colgate.eduwrote:


 If Skidmore has an IR I'd looking into adding them into your IR and render
 from there (in addition to WARC'ing them)



Francis, I'm confused when you say in addition to WARC'ing them. Wouldn't
you be putting the WARC into the IR and using it to render? Or are you
advocating that a format other than WARC should go into the IR?

Thanks,
Nathan


Re: [CODE4LIB] archiving web pages

2014-01-14 Thread Nathan Tallman
Lisa,

Is your local web archive available online? I'd like to see a production
example of non-Internet Archive instance of Wayback/Open Wayback.

Thanks,
Nathan


On Tue, Jan 14, 2014 at 12:17 PM, L Snider lsni...@gmail.com wrote:

 Rob is right on! I included the wrong link, thanks for catching that...

 Cheers

 Lisa


 On Tue, Jan 14, 2014 at 11:04 AM, Robert Sanderson azarot...@gmail.com
 wrote:

  For what it's worth, the latest wayback code is:
 
  https://github.com/iipc/openwayback
 
  And being developed by the IIPC consortium, rather than just the Internet
  Archive alone.
  It has many additional features, contributed by other members.
 
  It should be used in preference to the sourceforge version, IMO.
 
  Rob
 
 
 
 
  On Tue, Jan 14, 2014 at 10:00 AM, L Snider lsni...@gmail.com wrote:
 
   Hi Kathryn,
  
   Right now the WARC format is considered the best preservation format
 for
   websites/social media, in terms of digital archives. It is our best
 guess
   right now. It will likely will be with us for a long time, because it
 has
   been adopted by most of the major players.
  
   The way I have seen WARCs served up is through Wayback, the manual
  version
   of the Internet Archive's Wayback machine.
   http://archive-access.sourceforge.net/projects/wayback/index.html
  
   I have only used Heritrix and Wayback together, so I haven't played
 with
   Wayback and WARCs made another way.
  
   I would stick with WARC in terms of preservation, access is another
   story...that would depend on budget, time, etc.
  
   Hope that helps.
  
   Cheers
  
   Lisa
   --
   Lisa Snider
   Electronic Records Archivist
   Harry Ransom Center
   The University of Texas at Austin
   P.O. Box 7219
   Austin, Texas 78713-7219
   P: 512-232-4616
   www.hrc.utexas.edu
  
  
  
   On Tue, Jan 14, 2014 at 10:48 AM, Kathryn Frederick (Library) 
   kfred...@skidmore.edu wrote:
  
Hi,
I'm trying to develop a strategy for preserving issues our school's
   online
newspaper. Creating a WARC file of the content seems straightforward,
  but
how will that content fair long-term? Also, how is the WARC served to
  an
end-user? Is there some other method I should look at?
Thanks in advance for any advice!
Kathryn
   
  
 



Re: [CODE4LIB] archiving web pages

2014-01-14 Thread L Snider
Hi Nathan,

Nope, unfortunately not...It was done as a test, and at that time we used
the IA only version.

Cheers

Lisa


On Tue, Jan 14, 2014 at 11:31 AM, Nathan Tallman ntall...@gmail.com wrote:

 Lisa,

 Is your local web archive available online? I'd like to see a production
 example of non-Internet Archive instance of Wayback/Open Wayback.

 Thanks,
 Nathan


 On Tue, Jan 14, 2014 at 12:17 PM, L Snider lsni...@gmail.com wrote:

  Rob is right on! I included the wrong link, thanks for catching that...
 
  Cheers
 
  Lisa
 
 
  On Tue, Jan 14, 2014 at 11:04 AM, Robert Sanderson azarot...@gmail.com
  wrote:
 
   For what it's worth, the latest wayback code is:
  
   https://github.com/iipc/openwayback
  
   And being developed by the IIPC consortium, rather than just the
 Internet
   Archive alone.
   It has many additional features, contributed by other members.
  
   It should be used in preference to the sourceforge version, IMO.
  
   Rob
  
  
  
  
   On Tue, Jan 14, 2014 at 10:00 AM, L Snider lsni...@gmail.com wrote:
  
Hi Kathryn,
   
Right now the WARC format is considered the best preservation format
  for
websites/social media, in terms of digital archives. It is our best
  guess
right now. It will likely will be with us for a long time, because it
  has
been adopted by most of the major players.
   
The way I have seen WARCs served up is through Wayback, the manual
   version
of the Internet Archive's Wayback machine.
http://archive-access.sourceforge.net/projects/wayback/index.html
   
I have only used Heritrix and Wayback together, so I haven't played
  with
Wayback and WARCs made another way.
   
I would stick with WARC in terms of preservation, access is another
story...that would depend on budget, time, etc.
   
Hope that helps.
   
Cheers
   
Lisa
--
Lisa Snider
Electronic Records Archivist
Harry Ransom Center
The University of Texas at Austin
P.O. Box 7219
Austin, Texas 78713-7219
P: 512-232-4616
www.hrc.utexas.edu
   
   
   
On Tue, Jan 14, 2014 at 10:48 AM, Kathryn Frederick (Library) 
kfred...@skidmore.edu wrote:
   
 Hi,
 I'm trying to develop a strategy for preserving issues our school's
online
 newspaper. Creating a WARC file of the content seems
 straightforward,
   but
 how will that content fair long-term? Also, how is the WARC served
 to
   an
 end-user? Is there some other method I should look at?
 Thanks in advance for any advice!
 Kathryn

   
  
 



Re: [CODE4LIB] archiving web pages

2014-01-14 Thread Robert Sanderson
Here are several to consider:

*
http://www.webarchive.org.uk/wayback/archive/*/http://www.aboutmayfair.co.uk/
*
http://webarchive.loc.gov/lcwa0015/*/http://lawprofessors.typepad.com/adminlaw/
* http://www.padi.cat:8080/wayback/*/http://www.ajberga.cat/
* http://vefsafn.is/index.php?page=english


Hope that helps :)

Rob






On Tue, Jan 14, 2014 at 10:31 AM, Nathan Tallman ntall...@gmail.com wrote:

 Lisa,

 Is your local web archive available online? I'd like to see a production
 example of non-Internet Archive instance of Wayback/Open Wayback.

 Thanks,
 Nathan


 On Tue, Jan 14, 2014 at 12:17 PM, L Snider lsni...@gmail.com wrote:

  Rob is right on! I included the wrong link, thanks for catching that...
 
  Cheers
 
  Lisa
 
 
  On Tue, Jan 14, 2014 at 11:04 AM, Robert Sanderson azarot...@gmail.com
  wrote:
 
   For what it's worth, the latest wayback code is:
  
   https://github.com/iipc/openwayback
  
   And being developed by the IIPC consortium, rather than just the
 Internet
   Archive alone.
   It has many additional features, contributed by other members.
  
   It should be used in preference to the sourceforge version, IMO.
  
   Rob
  
  
  
  
   On Tue, Jan 14, 2014 at 10:00 AM, L Snider lsni...@gmail.com wrote:
  
Hi Kathryn,
   
Right now the WARC format is considered the best preservation format
  for
websites/social media, in terms of digital archives. It is our best
  guess
right now. It will likely will be with us for a long time, because it
  has
been adopted by most of the major players.
   
The way I have seen WARCs served up is through Wayback, the manual
   version
of the Internet Archive's Wayback machine.
http://archive-access.sourceforge.net/projects/wayback/index.html
   
I have only used Heritrix and Wayback together, so I haven't played
  with
Wayback and WARCs made another way.
   
I would stick with WARC in terms of preservation, access is another
story...that would depend on budget, time, etc.
   
Hope that helps.
   
Cheers
   
Lisa
--
Lisa Snider
Electronic Records Archivist
Harry Ransom Center
The University of Texas at Austin
P.O. Box 7219
Austin, Texas 78713-7219
P: 512-232-4616
www.hrc.utexas.edu
   
   
   
On Tue, Jan 14, 2014 at 10:48 AM, Kathryn Frederick (Library) 
kfred...@skidmore.edu wrote:
   
 Hi,
 I'm trying to develop a strategy for preserving issues our school's
online
 newspaper. Creating a WARC file of the content seems
 straightforward,
   but
 how will that content fair long-term? Also, how is the WARC served
 to
   an
 end-user? Is there some other method I should look at?
 Thanks in advance for any advice!
 Kathryn

   
  
 



Re: [CODE4LIB] archiving web pages

2014-01-14 Thread Nick Ruest

Hi-

We actually have implemented the original question above with some shell 
scripts[1] for harvesting, and creating SIPs. The SIPs are then ingested 
into our Islandora instance with the Web ARChive Solution Pack[2] as 
AIPs. DIPs are also available via our local Wayback instance[3], and on 
an given object page.


For example, here is the crawl of YFile from December 26, 2013 in 
Islandora[4] with associated derivatives, and here it is rendered in our 
local Wayback[5].


If you're curious about the Islandora Web ARChive Solution Pack, I have 
written up a couple posts on it[6][7].


...and as always, if you notice that I'm doing something wrong, let me 
know, or fork and contribute!


cheers!

-nruest

[1] https://github.com/yorkulibraries/yudl-web-archiving
[2] https://github.com/Islandora/islandora_solution_pack_web_archive
[3] http://digital.library.yorku.ca/wayback
[4] http://digital.library.yorku.ca/yul-113521/yfile-2013-12-26
[5] 
http://digital.library.yorku.ca/wayback/20131226053032/http://yfile.news.yorku.ca/
[6] 
http://ruebot.net/content/islandora-web-archive-solution-pack-open-repositories-2013

[7] http://ruebot.net/post/islandora-web-archive-sp-updates


On 14-01-14 12:26 PM, Nathan Tallman wrote:

On Tue, Jan 14, 2014 at 12:08 PM, Francis Kayiwa fkay...@colgate.eduwrote:



If Skidmore has an IR I'd looking into adding them into your IR and render
from there (in addition to WARC'ing them)




Francis, I'm confused when you say in addition to WARC'ing them. Wouldn't
you be putting the WARC into the IR and using it to render? Or are you
advocating that a format other than WARC should go into the IR?

Thanks,
Nathan



Re: [CODE4LIB] archiving web pages

2014-01-14 Thread Kari R Smith
Kathryn,
When you write strategy do you mean a technology solution or a preservation 
strategy, one component of which is the technology implementation of said 
strategy?  If it's a preservation strategy for your school's online (web) 
content - so archival records - see what the University of Michigan's Bentley 
Library has to offer in terms of written strategies and plan for web archiving 
of University web-based content.

Kari

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@listserv.nd.edu] On Behalf Of Kathryn 
Frederick (Library)
Sent: Tuesday, January 14, 2014 11:49 AM
To: CODE4LIB@listserv.nd.edu
Subject: [CODE4LIB] archiving web pages

Hi,
I'm trying to develop a strategy for preserving issues our school's online 
newspaper. Creating a WARC file of the content seems straightforward, but how 
will that content fair long-term? Also, how is the WARC served to an end-user? 
Is there some other method I should look at?
Thanks in advance for any advice!
Kathryn


Re: [CODE4LIB] archiving web pages

2014-01-14 Thread Francis Kayiwa

On 1/14/2014 12:26 PM, Nathan Tallman wrote:

On Tue, Jan 14, 2014 at 12:08 PM, Francis Kayiwa fkay...@colgate.eduwrote:



If Skidmore has an IR I'd looking into adding them into your IR and render
from there (in addition to WARC'ing them)




Francis, I'm confused when you say in addition to WARC'ing them. Wouldn't
you be putting the WARC into the IR and using it to render? Or are you
advocating that a format other than WARC should go into the IR?


I initially meant the latter but now that you ask/questioned my 
thinking, I've revised it ;-)



./fxk


--
Cheap things are of no value, valuable things are not cheap.


Re: [CODE4LIB] archiving web pages

2014-01-14 Thread Kathryn Frederick (Library)
Thanks for the thoughtful responses. We've been actively digitizing our print 
paper (which ceased publication in 2011) and I was thinking of this as an 
extension of that effort. Right now, I think capturing a monthly WARC file of 
the site is definitely a good idea no matter what. But beyond that, as Kyle 
pointed out, it's not really the web site I'm after but the content. I'd like 
to present this content alongside print issues in our IR (currently ContentDM). 
In one sense, I can see doing a weekly capture of the site which would equate 
to an issue in the old format. But, I could also do a PDF of the content. A PDF 
makes sense to me in the context of a collection that is largely print-based 
and gets at what I want (keyword searchable content, authors, dates), but is it 
disingenuous to fundamentally alter the format? Plus there's the work 
involved... This may be a question for archivists, but I'm not one so would 
appreciate any additional thoughts from this group. 

On Tue, Jan 14, 2014 at 10:48 AM, Kathryn Frederick (Library) 
kfred...@skidmore.edu wrote:

 Hi,
 I'm trying to develop a strategy for preserving issues our school's online
 newspaper. Creating a WARC file of the content seems straightforward, but
 how will that content fair long-term? Also, how is the WARC served to an
 end-user? Is there some other method I should look at?
 Thanks in advance for any advice!
 Kathryn



Re: [CODE4LIB] archiving web pages

2014-01-14 Thread L Snider
As an archivist, I don't see any problem using a PDF. Technically it should
be a PDF-A, but realistically it is usually a PDF.

I have done projects where I used PDFs for the archiving of full websites.
It can be quite handy, depending on needs of course. Sometimes it works
with the look and feel/design, and sometimes it doesn't. Content is pretty
good usually, in my experience.

Do a test and see whether your site crashes your Adobe product...sometimes
the code, special effects or just size can crash it without a PDF being
made...Plus look at the levels you want captured, that can also cause a
mess too.

Cheers

Lisa

-- 
Lisa Snider
Electronic Records Archivist
Harry Ransom Center
The University of Texas at Austin
P.O. Box 7219
Austin, Texas 78713-7219
P: 512-232-4616
www.hrc.utexas.edu



On Tue, Jan 14, 2014 at 12:48 PM, Kathryn Frederick (Library) 
kfred...@skidmore.edu wrote:

 Thanks for the thoughtful responses. We've been actively digitizing our
 print paper (which ceased publication in 2011) and I was thinking of this
 as an extension of that effort. Right now, I think capturing a monthly WARC
 file of the site is definitely a good idea no matter what. But beyond that,
 as Kyle pointed out, it's not really the web site I'm after but the
 content. I'd like to present this content alongside print issues in our IR
 (currently ContentDM). In one sense, I can see doing a weekly capture of
 the site which would equate to an issue in the old format. But, I could
 also do a PDF of the content. A PDF makes sense to me in the context of a
 collection that is largely print-based and gets at what I want (keyword
 searchable content, authors, dates), but is it disingenuous to
 fundamentally alter the format? Plus there's the work involved... This may
 be a question for archivists, but I'm not one so would appreciate any
 additional thoughts from this group.

 On Tue, Jan 14, 2014 at 10:48 AM, Kathryn Frederick (Library) 
 kfred...@skidmore.edu wrote:

  Hi,
  I'm trying to develop a strategy for preserving issues our school's
 online
  newspaper. Creating a WARC file of the content seems straightforward, but
  how will that content fair long-term? Also, how is the WARC served to an
  end-user? Is there some other method I should look at?
  Thanks in advance for any advice!
  Kathryn