Gregory Lepore created TIKA-4188:
------------------------------------

             Summary: Add support for ARC files
                 Key: TIKA-4188
                 URL: https://issues.apache.org/jira/browse/TIKA-4188
             Project: Tika
          Issue Type: Improvement
            Reporter: Gregory Lepore


The original version of the Internet Archive's storage format is the ARC format 
(later superseded by WARC and WACZ). 

The ARC (Archive) format is a file format used for storing web archives. It was 
developed by the Internet Archive to facilitate the mass storage of web pages, 
capturing the content as it appeared on the Internet at specific points in 
time. An ARC file is a single, large file that contains a sequence of archived 
web resources. Each entry in an ARC file includes the URL of the resource, the 
date it was captured, the HTTP response headers, and the content of the 
resource itself (such as HTML pages, images, and other media types).

The structure of an ARC file generally consists of a file header followed by a 
series of records, each representing an individual web resource. The ARC file 
can be gzipped using a two step process where each record in the ARC file is 
gzipped, and then the entire file is gzipped.

The original ARC format specification is here:
[https://archive.org/web/researcher/ArcFileFormat.php]

The WARC format is currently supported via jwarc, which also appears to have 
support for the ARC format (https://github.com/iipc/jwarc)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to