WARC File Format Published as an International Standard
  
The International Internet Preservation Consortium is pleased to 
announce the publication of the WARC file format as an international 
standard: ISO 28500:2009, Information and documentation -- WARC file format.
 
For many years, heritage organizations have tried to find the most 
appropriate ways to collect and keep track of World Wide Web material 
using web-scale tools such as web crawlers. At the same time, these 
organizations were concerned with the requirement to archive very large 
numbers of born-digital and digitized files. A need was for a container 
format that permits one file simply and safely to carry a very large 
number of constituent data objects (of unrestricted type, including many 
binary types) for the purpose of storage, management, and exchange. 
Another requirement was that the container need only minimal knowledge 
of the nature of the objects.
 
The WARC format is expected to be a standard way to structure, manage 
and store billions of resources collected from the web and elsewhere. It 
is an extension of the ARC format , which has been used since 1996 to 
store files harvested on the web. WARC format offers new possibilities, 
notably the recording of HTTP request headers, the recording of 
arbitrary metadata, the allocation of an identifier for every contained 
file, the management of duplicates and of migrated records, and the 
segmentation of the records. WARC files are intended to store every type 
of digital content, either retrieved by HTTP or another protocol.
 
The motivation to extend the ARC format arose from the discussion and 
experiences of the International Internet Preservation Consortium, whose 
core mission is to acquire, preserve and make accessible knowledge and 
information from the Internet for future generations. IIPC Standards 
Working Group put forward to ISO TC46/SC4/WG12 a draft presenting the 
WARC file format. The draft was accepted as a new Work Item by ISO in 
May 2005.
 
Over a period of four years, the ISO working group, with the 
Bibliothèque nationale de France as convener, collaborated closely with 
IIPC experts to improve the original draft. The WG12 will continue to 
maintain the standard and prepare its future revision.
 
Standardization offers a guarantee of durability and evolution for the 
WARC format. It will help web archiving entering into the mainstream 
activities of heritage institutions and other branches, by fostering the 
development of new tools and ensuring the interoperability of 
collections. Several applications are already WARC compliant, such as 
the Heritrix crawler for harvesting, the WARC tools for data management 
and exchange, the Wayback Machine, NutchWAX and other search tools for 
access. The international recognition of the WARC format and its 
applicability to every kind of digital object will provide strong 
incentives to use it within and beyond the web archiving community.
 
General information about the IIPC can be found at: http://netpreserve.org
 
_______________________________________________
Instruções para desiscrever-se por conta própria:
http://listas.ibict.br/cgi-bin/mailman/options/bib_virtual
Bib_virtual mailing list
[email protected]
http://listas.ibict.br/cgi-bin/mailman/listinfo/bib_virtual

Responder a