Archiving the web

Born digital
Oct 21st 2010
>From The Economist print edition

National libraries start to preserve the web, but cannot save everything



[cid:[email protected]]

The library of the future



IN THE digital realm, things seem always to happen the wrong way round. Whereas 
Google has hurried to scan books into its digital catalogue, a group of 
national libraries has begun saving what the online giant leaves behind. For 
although search engines such as Google index the web, they do not archive it. 
Many websites just disappear when their owner runs out of money or interest. 
Adam Farquhar, in charge of digital projects for the British Library, points 
out that the world has in some ways a better record of the beginning of the 
20th century than of the beginning of the 21st.
In 1996 Brewster Kahle, a computer scientist and internet entrepreneur, founded 
the Internet Archive, a non-profit organisation dedicated to preserving 
websites. He also began gently harassing national libraries to worry about 
preserving the web. They started to pay attention when several elections 
produced interesting material that never touched paper.
In 2003 eleven national libraries and the Internet Archive launched a project 
to preserve "born-digital" information: the kind that has never existed as 
anything but digitally. Called the International Internet Preservation 
Consortium (IIPC), it now includes 39 large institutional libraries. But the 
task is impossible. One reason is the sheer amount of data on the web. The 
groups have already collected several petabytes of data (a petabyte can hold 
roughly 10 trillion copies of this article).
Another issue is ensuring that the data is stored in a format that makes it 
available in centuries to come. Ancient manuscripts are still readable. But 
much digital media from the past is readable only on a handful of fragile and 
antique machines, if at all. The IIPC has set a single format, making it more 
likely that future historians will be able to find a machine to read the data. 
But a single solution cannot capture all content. Web publishers increasingly 
serve up content-rich pages based on complex data sets. Audio and video 
programmes based on proprietary formats such as Windows Media Player are 
another challenge. What happens if Microsoft is bankrupt and forgotten in 2210?
The biggest problem, for now, is money. The British Library estimates that it 
costs half as much to store a digital document as it does a physical one. But 
there are a lot more digital ones. America's Library of Congress enjoys a 
specific mandate, and budget, to save the web. The British Library is still 
seeking one.
So national libraries have decided to split the task. Each has taken 
responsibility for the digital works in its national top-level domain 
(web-address suffixes such as ".uk" or ".fr"). In countries with larger 
domains, such as Britain and America, curators cannot hope to save everything. 
They are concentrating on material of national interest, such as elections, 
news sites and citizen journalism or innovative uses of the web.
The daily death of countless websites has brought a new sense of urgency-and 
forced libraries to adapt culturally as well. Past practice was to tag every 
new document as it arrived. Now precision must be sacrificed to scale and 
speed. The task started before standards, goals or budgets are set. And they 
may yet change. Just like many websites, libraries will be stuck in what is 
known as "permanent beta".





Copyright © 2010 The Economist Newspaper and The Economist Group. All rights 
reserved.


________________________________

José A. López
Globomedia, Dpto. de Documentación-Comunicación
[email protected]<mailto:[email protected]>




----------------------------------------------------
Los artículos de IWETEL son distribuidos gracias al apoyo y colaboración 
técnica de RedIRIS - Red Académica española - (http://www.rediris.es)
----------------------------------------------------

<<inline: image001.jpg>>

<<inline: image002.png>>

<<inline: image003.png>>

Responder a