Re: [CODE4LIB] archiving web pages
As an archivist I would suggest that rather than thinking up all the possible requirements, check with your archives staff, your institutional records policy, and your archives collections policy to find out what their actual requirements are. Having the full digital content as it was displayed is important for preservation. As archivists part of our job is to represent in description what the content is, how is was in context of the time it was created and used, and what has been done it to present it to users (over time.) Ad layout may be different from what the specific ads were. Taking snapshots for the particular ads may be different than having full dynamic reconstructions of websites. Providing non-dynamic PDFs of webpages may not be the same as following the navigation pathways through a website. Kari Smith -Original Message- From: Code for Libraries [mailto:CODE4LIB@listserv.nd.edu] On Behalf Of Wilhelmina Randtke Sent: Wednesday, January 15, 2014 10:29 AM To: CODE4LIB@listserv.nd.edu Subject: Re: [CODE4LIB] archiving web pages Agreed, don't focus too much on preserving the presentation for an online newspaper. The text and images are important, but the layout isn't so important. -Wilhelmina Randtke On Tue, Jan 14, 2014 at 10:59 AM, Kyle Banerjee kyle.baner...@gmail.comwrote: IMO, there are many web archiving situations where it is more appropriate to just focus on the content rather than the manifestation of the content. Just as you wouldn't expect a 1995 article from the NYT to be displayed as the website was in 1995 or an article in an online database to actually appear like it originally appeared online, it's the content rather than the skin that's relevant in the case of a newspaper. If you make sure it's in a format that can be migrated forward and added to standalone or union systems that provide access to this sort of stuff, you'll be fine. kyle On Tue, Jan 14, 2014 at 8:48 AM, Kathryn Frederick (Library) kfred...@skidmore.edu wrote: Hi, I'm trying to develop a strategy for preserving issues our school's online newspaper. Creating a WARC file of the content seems straightforward, but how will that content fair long-term? Also, how is the WARC served to an end-user? Is there some other method I should look at? Thanks in advance for any advice! Kathryn
Re: [CODE4LIB] archiving web pages
Here is another: http://wax.lib.harvard.edu/collections/home.do - Randy -- Date:Tue, 14 Jan 2014 10:43:18 -0700 From:Robert Sanderson azarot...@gmail.com Subject: Re: archiving web pages Here are several to consider: * http://www.webarchive.org.uk/wayback/archive/*/http://www.aboutmayfair.co.uk/ * http://webarchive.loc.gov/lcwa0015/*/http://lawprofessors.typepad.com/adminlaw/ * http://www.padi.cat:8080/wayback/*/http://www.ajberga.cat/ * http://vefsafn.is/index.php?page=english Hope that helps :) Rob On Tue, Jan 14, 2014 at 10:31 AM, Nathan Tallman ntall...@gmail.com wrote: Lisa, Is your local web archive available online? I'd like to see a production example of non-Internet Archive instance of Wayback/Open Wayback. Thanks, Nathan
Re: [CODE4LIB] archiving web pages
Agreed, don't focus too much on preserving the presentation for an online newspaper. The text and images are important, but the layout isn't so important. -Wilhelmina Randtke On Tue, Jan 14, 2014 at 10:59 AM, Kyle Banerjee kyle.baner...@gmail.comwrote: IMO, there are many web archiving situations where it is more appropriate to just focus on the content rather than the manifestation of the content. Just as you wouldn't expect a 1995 article from the NYT to be displayed as the website was in 1995 or an article in an online database to actually appear like it originally appeared online, it's the content rather than the skin that's relevant in the case of a newspaper. If you make sure it's in a format that can be migrated forward and added to standalone or union systems that provide access to this sort of stuff, you'll be fine. kyle On Tue, Jan 14, 2014 at 8:48 AM, Kathryn Frederick (Library) kfred...@skidmore.edu wrote: Hi, I'm trying to develop a strategy for preserving issues our school's online newspaper. Creating a WARC file of the content seems straightforward, but how will that content fair long-term? Also, how is the WARC served to an end-user? Is there some other method I should look at? Thanks in advance for any advice! Kathryn
Re: [CODE4LIB] archiving web pages
If it's doable, I think preserving the whole enchilada is desirable. For instance, at my last library, there was a regular assignment where students needed the print version of old periodicals because they were tasked with analysing the ads and layouts. Someone might be interested in web layouts from the 2000s, and there might be content (again, ads, but also masthead logos, ???) that might not otherwise be captured. Andrew On Wed, Jan 15, 2014 at 10:29 AM, Wilhelmina Randtke rand...@gmail.comwrote: Agreed, don't focus too much on preserving the presentation for an online newspaper. The text and images are important, but the layout isn't so important. -Wilhelmina Randtke On Tue, Jan 14, 2014 at 10:59 AM, Kyle Banerjee kyle.baner...@gmail.com wrote: IMO, there are many web archiving situations where it is more appropriate to just focus on the content rather than the manifestation of the content. Just as you wouldn't expect a 1995 article from the NYT to be displayed as the website was in 1995 or an article in an online database to actually appear like it originally appeared online, it's the content rather than the skin that's relevant in the case of a newspaper. If you make sure it's in a format that can be migrated forward and added to standalone or union systems that provide access to this sort of stuff, you'll be fine. kyle On Tue, Jan 14, 2014 at 8:48 AM, Kathryn Frederick (Library) kfred...@skidmore.edu wrote: Hi, I'm trying to develop a strategy for preserving issues our school's online newspaper. Creating a WARC file of the content seems straightforward, but how will that content fair long-term? Also, how is the WARC served to an end-user? Is there some other method I should look at? Thanks in advance for any advice! Kathryn -- Andrew Darby Head, Web Emerging Technologies University of Miami Libraries
Re: [CODE4LIB] archiving web pages
There's always the option of capturing a WARC of the newspaper as the preservation master for dark storage, and generating PDFs for access via your CMS. If you're in ContentDM already, then a PDF would be much easier to use (both on the back and frontends). The provenance metadata of WARC is too important not to capture, but I agree that it can be awkward to use for access. A hybrid approach of generating WARCs and PDFs may be best - the PDF will handle most of your use cases, and any further questions/issues (e.g. rendering questions, research into interactive advertisements, etc.) can defer to the WARC. I've used this approach elsewhere, and it was a relief to know that we could always go back to a WARC file to resolve issues of provenance/authenticity/content. --Alex On Wed, Jan 15, 2014 at 11:52 AM, Andrew Darby darby.li...@gmail.comwrote: If it's doable, I think preserving the whole enchilada is desirable. For instance, at my last library, there was a regular assignment where students needed the print version of old periodicals because they were tasked with analysing the ads and layouts. Someone might be interested in web layouts from the 2000s, and there might be content (again, ads, but also masthead logos, ???) that might not otherwise be captured. Andrew On Wed, Jan 15, 2014 at 10:29 AM, Wilhelmina Randtke rand...@gmail.com wrote: Agreed, don't focus too much on preserving the presentation for an online newspaper. The text and images are important, but the layout isn't so important. -Wilhelmina Randtke On Tue, Jan 14, 2014 at 10:59 AM, Kyle Banerjee kyle.baner...@gmail.com wrote: IMO, there are many web archiving situations where it is more appropriate to just focus on the content rather than the manifestation of the content. Just as you wouldn't expect a 1995 article from the NYT to be displayed as the website was in 1995 or an article in an online database to actually appear like it originally appeared online, it's the content rather than the skin that's relevant in the case of a newspaper. If you make sure it's in a format that can be migrated forward and added to standalone or union systems that provide access to this sort of stuff, you'll be fine. kyle On Tue, Jan 14, 2014 at 8:48 AM, Kathryn Frederick (Library) kfred...@skidmore.edu wrote: Hi, I'm trying to develop a strategy for preserving issues our school's online newspaper. Creating a WARC file of the content seems straightforward, but how will that content fair long-term? Also, how is the WARC served to an end-user? Is there some other method I should look at? Thanks in advance for any advice! Kathryn -- Andrew Darby Head, Web Emerging Technologies University of Miami Libraries
Re: [CODE4LIB] archiving web pages
On Wed, Jan 15, 2014 at 8:52 AM, Andrew Darby darby.li...@gmail.com wrote: If it's doable, I think preserving the whole enchilada is desirable. For instance, at my last library, there was a regular assignment where students needed the print version of old periodicals because they were tasked with analysing the ads and layouts. Someone might be interested in web layouts from the 2000s, and there might be content (again, ads, but also masthead logos, ???) that might not otherwise be captured That often is not possible and that the number of circumstances when it is will only decrease over time. Except on flat sites designed according to a physical document model, the platform and the content work together to provide the experience. A reasonable argument can be made that taking snapshots of dynamic things is lossier than focusing on the data. With regards to the ads, what people see has varied dramatically based on a number of factors for quite awhile. Even if that weren't true, retaining information just because some academic could conceivably come up with a use for it is not a good reason to keep it. Everything in your trash/recycling may be very interesting from an archaeological point of view at some time, but it's still a good idea to pitch it. The shrinking role libraries play in the information sphere is way too small for us to pay to maintain stuff that has no purpose beyond meeting a use case that might exist at some indeterminate point in the future -- especially given the high costs of maintained storage. Fear not. We will leave no shortage physical and virtual information about ourselves to future generations. kyle
Re: [CODE4LIB] archiving web pages
+1 to Alex's suggestion to use WARC for the preservation master and generate PDFs for access. While I agree with Kyle that it's ultimately the content that's important and that hypothetical researcher needs are inexhaustible, I do think there's an advantage to preserving web content in a web-native way. Aside from verisimilitude, looking ahead to implementation of Memento (http://mementoweb.org/) - a mechanism for adding temporal navigation to the web through federated discovery of resources preserved in distributed web archives - data stored in WARC will ultimately be better integrated into the fabric of the web than PDFs siloed in an individual institutional repository. I also wanted to mention (and encourage addition to!) the Wikipedia list of web archiving initiatives: http://en.wikipedia.org/wiki/List_of_Web_archiving_initiatives. It provides a good overview of many web archiving institutions' programs, data formats, technology stacks, and access provisions (including links to their Wayback implementations). ~Nicholas -- Nicholas Taylor Web Archiving Service Manager Stanford University Libraries
Re: [CODE4LIB] archiving web pages
IMO, there are many web archiving situations where it is more appropriate to just focus on the content rather than the manifestation of the content. Just as you wouldn't expect a 1995 article from the NYT to be displayed as the website was in 1995 or an article in an online database to actually appear like it originally appeared online, it's the content rather than the skin that's relevant in the case of a newspaper. If you make sure it's in a format that can be migrated forward and added to standalone or union systems that provide access to this sort of stuff, you'll be fine. kyle On Tue, Jan 14, 2014 at 8:48 AM, Kathryn Frederick (Library) kfred...@skidmore.edu wrote: Hi, I'm trying to develop a strategy for preserving issues our school's online newspaper. Creating a WARC file of the content seems straightforward, but how will that content fair long-term? Also, how is the WARC served to an end-user? Is there some other method I should look at? Thanks in advance for any advice! Kathryn
Re: [CODE4LIB] archiving web pages
Hi Kathryn, Right now the WARC format is considered the best preservation format for websites/social media, in terms of digital archives. It is our best guess right now. It will likely will be with us for a long time, because it has been adopted by most of the major players. The way I have seen WARCs served up is through Wayback, the manual version of the Internet Archive's Wayback machine. http://archive-access.sourceforge.net/projects/wayback/index.html I have only used Heritrix and Wayback together, so I haven't played with Wayback and WARCs made another way. I would stick with WARC in terms of preservation, access is another story...that would depend on budget, time, etc. Hope that helps. Cheers Lisa -- Lisa Snider Electronic Records Archivist Harry Ransom Center The University of Texas at Austin P.O. Box 7219 Austin, Texas 78713-7219 P: 512-232-4616 www.hrc.utexas.edu On Tue, Jan 14, 2014 at 10:48 AM, Kathryn Frederick (Library) kfred...@skidmore.edu wrote: Hi, I'm trying to develop a strategy for preserving issues our school's online newspaper. Creating a WARC file of the content seems straightforward, but how will that content fair long-term? Also, how is the WARC served to an end-user? Is there some other method I should look at? Thanks in advance for any advice! Kathryn
Re: [CODE4LIB] archiving web pages
For what it's worth, the latest wayback code is: https://github.com/iipc/openwayback And being developed by the IIPC consortium, rather than just the Internet Archive alone. It has many additional features, contributed by other members. It should be used in preference to the sourceforge version, IMO. Rob On Tue, Jan 14, 2014 at 10:00 AM, L Snider lsni...@gmail.com wrote: Hi Kathryn, Right now the WARC format is considered the best preservation format for websites/social media, in terms of digital archives. It is our best guess right now. It will likely will be with us for a long time, because it has been adopted by most of the major players. The way I have seen WARCs served up is through Wayback, the manual version of the Internet Archive's Wayback machine. http://archive-access.sourceforge.net/projects/wayback/index.html I have only used Heritrix and Wayback together, so I haven't played with Wayback and WARCs made another way. I would stick with WARC in terms of preservation, access is another story...that would depend on budget, time, etc. Hope that helps. Cheers Lisa -- Lisa Snider Electronic Records Archivist Harry Ransom Center The University of Texas at Austin P.O. Box 7219 Austin, Texas 78713-7219 P: 512-232-4616 www.hrc.utexas.edu On Tue, Jan 14, 2014 at 10:48 AM, Kathryn Frederick (Library) kfred...@skidmore.edu wrote: Hi, I'm trying to develop a strategy for preserving issues our school's online newspaper. Creating a WARC file of the content seems straightforward, but how will that content fair long-term? Also, how is the WARC served to an end-user? Is there some other method I should look at? Thanks in advance for any advice! Kathryn
Re: [CODE4LIB] archiving web pages
On 1/14/2014 11:48 AM, Kathryn Frederick (Library) wrote: Hi, I'm trying to develop a strategy for preserving issues our school's online newspaper. Creating a WARC file of the content seems straightforward, but how will that content fair long-term? Also, how is the WARC served to an end-user? Is there some other method I should look at? Thanks in advance for any advice! WARC's good but I feel you are asking two questions when you add how will you render using WARC. (apologies if I'm not grokking your meaning) If Skidmore has an IR I'd looking into adding them into your IR and render from there (in addition to WARC'ing them) Cheers, ./fxk -- Cheap things are of no value, valuable things are not cheap.
Re: [CODE4LIB] archiving web pages
Rob is right on! I included the wrong link, thanks for catching that... Cheers Lisa On Tue, Jan 14, 2014 at 11:04 AM, Robert Sanderson azarot...@gmail.comwrote: For what it's worth, the latest wayback code is: https://github.com/iipc/openwayback And being developed by the IIPC consortium, rather than just the Internet Archive alone. It has many additional features, contributed by other members. It should be used in preference to the sourceforge version, IMO. Rob On Tue, Jan 14, 2014 at 10:00 AM, L Snider lsni...@gmail.com wrote: Hi Kathryn, Right now the WARC format is considered the best preservation format for websites/social media, in terms of digital archives. It is our best guess right now. It will likely will be with us for a long time, because it has been adopted by most of the major players. The way I have seen WARCs served up is through Wayback, the manual version of the Internet Archive's Wayback machine. http://archive-access.sourceforge.net/projects/wayback/index.html I have only used Heritrix and Wayback together, so I haven't played with Wayback and WARCs made another way. I would stick with WARC in terms of preservation, access is another story...that would depend on budget, time, etc. Hope that helps. Cheers Lisa -- Lisa Snider Electronic Records Archivist Harry Ransom Center The University of Texas at Austin P.O. Box 7219 Austin, Texas 78713-7219 P: 512-232-4616 www.hrc.utexas.edu On Tue, Jan 14, 2014 at 10:48 AM, Kathryn Frederick (Library) kfred...@skidmore.edu wrote: Hi, I'm trying to develop a strategy for preserving issues our school's online newspaper. Creating a WARC file of the content seems straightforward, but how will that content fair long-term? Also, how is the WARC served to an end-user? Is there some other method I should look at? Thanks in advance for any advice! Kathryn
Re: [CODE4LIB] archiving web pages
On Tue, Jan 14, 2014 at 12:08 PM, Francis Kayiwa fkay...@colgate.eduwrote: If Skidmore has an IR I'd looking into adding them into your IR and render from there (in addition to WARC'ing them) Francis, I'm confused when you say in addition to WARC'ing them. Wouldn't you be putting the WARC into the IR and using it to render? Or are you advocating that a format other than WARC should go into the IR? Thanks, Nathan
Re: [CODE4LIB] archiving web pages
Lisa, Is your local web archive available online? I'd like to see a production example of non-Internet Archive instance of Wayback/Open Wayback. Thanks, Nathan On Tue, Jan 14, 2014 at 12:17 PM, L Snider lsni...@gmail.com wrote: Rob is right on! I included the wrong link, thanks for catching that... Cheers Lisa On Tue, Jan 14, 2014 at 11:04 AM, Robert Sanderson azarot...@gmail.com wrote: For what it's worth, the latest wayback code is: https://github.com/iipc/openwayback And being developed by the IIPC consortium, rather than just the Internet Archive alone. It has many additional features, contributed by other members. It should be used in preference to the sourceforge version, IMO. Rob On Tue, Jan 14, 2014 at 10:00 AM, L Snider lsni...@gmail.com wrote: Hi Kathryn, Right now the WARC format is considered the best preservation format for websites/social media, in terms of digital archives. It is our best guess right now. It will likely will be with us for a long time, because it has been adopted by most of the major players. The way I have seen WARCs served up is through Wayback, the manual version of the Internet Archive's Wayback machine. http://archive-access.sourceforge.net/projects/wayback/index.html I have only used Heritrix and Wayback together, so I haven't played with Wayback and WARCs made another way. I would stick with WARC in terms of preservation, access is another story...that would depend on budget, time, etc. Hope that helps. Cheers Lisa -- Lisa Snider Electronic Records Archivist Harry Ransom Center The University of Texas at Austin P.O. Box 7219 Austin, Texas 78713-7219 P: 512-232-4616 www.hrc.utexas.edu On Tue, Jan 14, 2014 at 10:48 AM, Kathryn Frederick (Library) kfred...@skidmore.edu wrote: Hi, I'm trying to develop a strategy for preserving issues our school's online newspaper. Creating a WARC file of the content seems straightforward, but how will that content fair long-term? Also, how is the WARC served to an end-user? Is there some other method I should look at? Thanks in advance for any advice! Kathryn
Re: [CODE4LIB] archiving web pages
Hi Nathan, Nope, unfortunately not...It was done as a test, and at that time we used the IA only version. Cheers Lisa On Tue, Jan 14, 2014 at 11:31 AM, Nathan Tallman ntall...@gmail.com wrote: Lisa, Is your local web archive available online? I'd like to see a production example of non-Internet Archive instance of Wayback/Open Wayback. Thanks, Nathan On Tue, Jan 14, 2014 at 12:17 PM, L Snider lsni...@gmail.com wrote: Rob is right on! I included the wrong link, thanks for catching that... Cheers Lisa On Tue, Jan 14, 2014 at 11:04 AM, Robert Sanderson azarot...@gmail.com wrote: For what it's worth, the latest wayback code is: https://github.com/iipc/openwayback And being developed by the IIPC consortium, rather than just the Internet Archive alone. It has many additional features, contributed by other members. It should be used in preference to the sourceforge version, IMO. Rob On Tue, Jan 14, 2014 at 10:00 AM, L Snider lsni...@gmail.com wrote: Hi Kathryn, Right now the WARC format is considered the best preservation format for websites/social media, in terms of digital archives. It is our best guess right now. It will likely will be with us for a long time, because it has been adopted by most of the major players. The way I have seen WARCs served up is through Wayback, the manual version of the Internet Archive's Wayback machine. http://archive-access.sourceforge.net/projects/wayback/index.html I have only used Heritrix and Wayback together, so I haven't played with Wayback and WARCs made another way. I would stick with WARC in terms of preservation, access is another story...that would depend on budget, time, etc. Hope that helps. Cheers Lisa -- Lisa Snider Electronic Records Archivist Harry Ransom Center The University of Texas at Austin P.O. Box 7219 Austin, Texas 78713-7219 P: 512-232-4616 www.hrc.utexas.edu On Tue, Jan 14, 2014 at 10:48 AM, Kathryn Frederick (Library) kfred...@skidmore.edu wrote: Hi, I'm trying to develop a strategy for preserving issues our school's online newspaper. Creating a WARC file of the content seems straightforward, but how will that content fair long-term? Also, how is the WARC served to an end-user? Is there some other method I should look at? Thanks in advance for any advice! Kathryn
Re: [CODE4LIB] archiving web pages
Here are several to consider: * http://www.webarchive.org.uk/wayback/archive/*/http://www.aboutmayfair.co.uk/ * http://webarchive.loc.gov/lcwa0015/*/http://lawprofessors.typepad.com/adminlaw/ * http://www.padi.cat:8080/wayback/*/http://www.ajberga.cat/ * http://vefsafn.is/index.php?page=english Hope that helps :) Rob On Tue, Jan 14, 2014 at 10:31 AM, Nathan Tallman ntall...@gmail.com wrote: Lisa, Is your local web archive available online? I'd like to see a production example of non-Internet Archive instance of Wayback/Open Wayback. Thanks, Nathan On Tue, Jan 14, 2014 at 12:17 PM, L Snider lsni...@gmail.com wrote: Rob is right on! I included the wrong link, thanks for catching that... Cheers Lisa On Tue, Jan 14, 2014 at 11:04 AM, Robert Sanderson azarot...@gmail.com wrote: For what it's worth, the latest wayback code is: https://github.com/iipc/openwayback And being developed by the IIPC consortium, rather than just the Internet Archive alone. It has many additional features, contributed by other members. It should be used in preference to the sourceforge version, IMO. Rob On Tue, Jan 14, 2014 at 10:00 AM, L Snider lsni...@gmail.com wrote: Hi Kathryn, Right now the WARC format is considered the best preservation format for websites/social media, in terms of digital archives. It is our best guess right now. It will likely will be with us for a long time, because it has been adopted by most of the major players. The way I have seen WARCs served up is through Wayback, the manual version of the Internet Archive's Wayback machine. http://archive-access.sourceforge.net/projects/wayback/index.html I have only used Heritrix and Wayback together, so I haven't played with Wayback and WARCs made another way. I would stick with WARC in terms of preservation, access is another story...that would depend on budget, time, etc. Hope that helps. Cheers Lisa -- Lisa Snider Electronic Records Archivist Harry Ransom Center The University of Texas at Austin P.O. Box 7219 Austin, Texas 78713-7219 P: 512-232-4616 www.hrc.utexas.edu On Tue, Jan 14, 2014 at 10:48 AM, Kathryn Frederick (Library) kfred...@skidmore.edu wrote: Hi, I'm trying to develop a strategy for preserving issues our school's online newspaper. Creating a WARC file of the content seems straightforward, but how will that content fair long-term? Also, how is the WARC served to an end-user? Is there some other method I should look at? Thanks in advance for any advice! Kathryn
Re: [CODE4LIB] archiving web pages
Hi- We actually have implemented the original question above with some shell scripts[1] for harvesting, and creating SIPs. The SIPs are then ingested into our Islandora instance with the Web ARChive Solution Pack[2] as AIPs. DIPs are also available via our local Wayback instance[3], and on an given object page. For example, here is the crawl of YFile from December 26, 2013 in Islandora[4] with associated derivatives, and here it is rendered in our local Wayback[5]. If you're curious about the Islandora Web ARChive Solution Pack, I have written up a couple posts on it[6][7]. ...and as always, if you notice that I'm doing something wrong, let me know, or fork and contribute! cheers! -nruest [1] https://github.com/yorkulibraries/yudl-web-archiving [2] https://github.com/Islandora/islandora_solution_pack_web_archive [3] http://digital.library.yorku.ca/wayback [4] http://digital.library.yorku.ca/yul-113521/yfile-2013-12-26 [5] http://digital.library.yorku.ca/wayback/20131226053032/http://yfile.news.yorku.ca/ [6] http://ruebot.net/content/islandora-web-archive-solution-pack-open-repositories-2013 [7] http://ruebot.net/post/islandora-web-archive-sp-updates On 14-01-14 12:26 PM, Nathan Tallman wrote: On Tue, Jan 14, 2014 at 12:08 PM, Francis Kayiwa fkay...@colgate.eduwrote: If Skidmore has an IR I'd looking into adding them into your IR and render from there (in addition to WARC'ing them) Francis, I'm confused when you say in addition to WARC'ing them. Wouldn't you be putting the WARC into the IR and using it to render? Or are you advocating that a format other than WARC should go into the IR? Thanks, Nathan
Re: [CODE4LIB] archiving web pages
Kathryn, When you write strategy do you mean a technology solution or a preservation strategy, one component of which is the technology implementation of said strategy? If it's a preservation strategy for your school's online (web) content - so archival records - see what the University of Michigan's Bentley Library has to offer in terms of written strategies and plan for web archiving of University web-based content. Kari -Original Message- From: Code for Libraries [mailto:CODE4LIB@listserv.nd.edu] On Behalf Of Kathryn Frederick (Library) Sent: Tuesday, January 14, 2014 11:49 AM To: CODE4LIB@listserv.nd.edu Subject: [CODE4LIB] archiving web pages Hi, I'm trying to develop a strategy for preserving issues our school's online newspaper. Creating a WARC file of the content seems straightforward, but how will that content fair long-term? Also, how is the WARC served to an end-user? Is there some other method I should look at? Thanks in advance for any advice! Kathryn
Re: [CODE4LIB] archiving web pages
On 1/14/2014 12:26 PM, Nathan Tallman wrote: On Tue, Jan 14, 2014 at 12:08 PM, Francis Kayiwa fkay...@colgate.eduwrote: If Skidmore has an IR I'd looking into adding them into your IR and render from there (in addition to WARC'ing them) Francis, I'm confused when you say in addition to WARC'ing them. Wouldn't you be putting the WARC into the IR and using it to render? Or are you advocating that a format other than WARC should go into the IR? I initially meant the latter but now that you ask/questioned my thinking, I've revised it ;-) ./fxk -- Cheap things are of no value, valuable things are not cheap.
Re: [CODE4LIB] archiving web pages
Thanks for the thoughtful responses. We've been actively digitizing our print paper (which ceased publication in 2011) and I was thinking of this as an extension of that effort. Right now, I think capturing a monthly WARC file of the site is definitely a good idea no matter what. But beyond that, as Kyle pointed out, it's not really the web site I'm after but the content. I'd like to present this content alongside print issues in our IR (currently ContentDM). In one sense, I can see doing a weekly capture of the site which would equate to an issue in the old format. But, I could also do a PDF of the content. A PDF makes sense to me in the context of a collection that is largely print-based and gets at what I want (keyword searchable content, authors, dates), but is it disingenuous to fundamentally alter the format? Plus there's the work involved... This may be a question for archivists, but I'm not one so would appreciate any additional thoughts from this group. On Tue, Jan 14, 2014 at 10:48 AM, Kathryn Frederick (Library) kfred...@skidmore.edu wrote: Hi, I'm trying to develop a strategy for preserving issues our school's online newspaper. Creating a WARC file of the content seems straightforward, but how will that content fair long-term? Also, how is the WARC served to an end-user? Is there some other method I should look at? Thanks in advance for any advice! Kathryn
Re: [CODE4LIB] archiving web pages
As an archivist, I don't see any problem using a PDF. Technically it should be a PDF-A, but realistically it is usually a PDF. I have done projects where I used PDFs for the archiving of full websites. It can be quite handy, depending on needs of course. Sometimes it works with the look and feel/design, and sometimes it doesn't. Content is pretty good usually, in my experience. Do a test and see whether your site crashes your Adobe product...sometimes the code, special effects or just size can crash it without a PDF being made...Plus look at the levels you want captured, that can also cause a mess too. Cheers Lisa -- Lisa Snider Electronic Records Archivist Harry Ransom Center The University of Texas at Austin P.O. Box 7219 Austin, Texas 78713-7219 P: 512-232-4616 www.hrc.utexas.edu On Tue, Jan 14, 2014 at 12:48 PM, Kathryn Frederick (Library) kfred...@skidmore.edu wrote: Thanks for the thoughtful responses. We've been actively digitizing our print paper (which ceased publication in 2011) and I was thinking of this as an extension of that effort. Right now, I think capturing a monthly WARC file of the site is definitely a good idea no matter what. But beyond that, as Kyle pointed out, it's not really the web site I'm after but the content. I'd like to present this content alongside print issues in our IR (currently ContentDM). In one sense, I can see doing a weekly capture of the site which would equate to an issue in the old format. But, I could also do a PDF of the content. A PDF makes sense to me in the context of a collection that is largely print-based and gets at what I want (keyword searchable content, authors, dates), but is it disingenuous to fundamentally alter the format? Plus there's the work involved... This may be a question for archivists, but I'm not one so would appreciate any additional thoughts from this group. On Tue, Jan 14, 2014 at 10:48 AM, Kathryn Frederick (Library) kfred...@skidmore.edu wrote: Hi, I'm trying to develop a strategy for preserving issues our school's online newspaper. Creating a WARC file of the content seems straightforward, but how will that content fair long-term? Also, how is the WARC served to an end-user? Is there some other method I should look at? Thanks in advance for any advice! Kathryn