Gregory Lepore created TIKA-4188: ------------------------------------ Summary: Add support for ARC files Key: TIKA-4188 URL: https://issues.apache.org/jira/browse/TIKA-4188 Project: Tika Issue Type: Improvement Reporter: Gregory Lepore
The original version of the Internet Archive's storage format is the ARC format (later superseded by WARC and WACZ). The ARC (Archive) format is a file format used for storing web archives. It was developed by the Internet Archive to facilitate the mass storage of web pages, capturing the content as it appeared on the Internet at specific points in time. An ARC file is a single, large file that contains a sequence of archived web resources. Each entry in an ARC file includes the URL of the resource, the date it was captured, the HTTP response headers, and the content of the resource itself (such as HTML pages, images, and other media types). The structure of an ARC file generally consists of a file header followed by a series of records, each representing an individual web resource. The ARC file can be gzipped using a two step process where each record in the ARC file is gzipped, and then the entire file is gzipped. The original ARC format specification is here: [https://archive.org/web/researcher/ArcFileFormat.php] The WARC format is currently supported via jwarc, which also appears to have support for the ARC format (https://github.com/iipc/jwarc) -- This message was sent by Atlassian Jira (v8.20.10#820010)