[ 
https://issues.apache.org/jira/browse/TIKA-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17815013#comment-17815013
 ] 

Tim Allison commented on TIKA-4188:
-----------------------------------

Oh, nice. thank you!

> Add support for ARC files
> -------------------------
>
>                 Key: TIKA-4188
>                 URL: https://issues.apache.org/jira/browse/TIKA-4188
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Gregory Lepore
>            Priority: Minor
>
> The original version of the Internet Archive's storage format is the ARC 
> format (later superseded by WARC and WACZ). 
> The ARC (Archive) format is a file format used for storing web archives. It 
> was developed by the Internet Archive to facilitate the mass storage of web 
> pages, capturing the content as it appeared on the Internet at specific 
> points in time. An ARC file is a single, large file that contains a sequence 
> of archived web resources. Each entry in an ARC file includes the URL of the 
> resource, the date it was captured, the HTTP response headers, and the 
> content of the resource itself (such as HTML pages, images, and other media 
> types).
> The structure of an ARC file generally consists of a file header followed by 
> a series of records, each representing an individual web resource. The ARC 
> file can be gzipped using a two step process where each record in the ARC 
> file is gzipped, and then the entire file is gzipped.
> The original ARC format specification is here:
> [https://archive.org/web/researcher/ArcFileFormat.php]
> The WARC format is currently supported via jwarc, which also appears to have 
> support for the ARC format (https://github.com/iipc/jwarc)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to