[jira] [Commented] (TIKA-4188) Add support for ARC files

Tim Allison (Jira) Tue, 06 Feb 2024 13:06:03 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17815006#comment-17815006
 ]


Tim Allison commented on TIKA-4188:
-----------------------------------

[~g...@rhobard.com], help! So, I thought this would be a "just add the arc mime 
to the supported set" for the WarcParser, and we'd be good to go, but that is 
just not working.

I looked more closely at ArcTest, and it reads three records for an arc file: a 
WarcInfo, one dns WarcResponse and then another actual WarcResponse from the 
target site. Do you have any idea if this is "normal"? Can we rely on this 
three record per ARC pattern?

Then, I read somewhere that you can gzip arcs just like you can gzip warcs... 
concatenating multiple gzip streams into one big gzip file, with random access 
available because offsets are stable. Any chance you could dummy up an example 
of that or find a publicly available version?

> Add support for ARC files
> -------------------------
>
>                 Key: TIKA-4188
>                 URL: https://issues.apache.org/jira/browse/TIKA-4188
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Gregory Lepore
>            Priority: Minor
>
> The original version of the Internet Archive's storage format is the ARC 
> format (later superseded by WARC and WACZ). 
> The ARC (Archive) format is a file format used for storing web archives. It 
> was developed by the Internet Archive to facilitate the mass storage of web 
> pages, capturing the content as it appeared on the Internet at specific 
> points in time. An ARC file is a single, large file that contains a sequence 
> of archived web resources. Each entry in an ARC file includes the URL of the 
> resource, the date it was captured, the HTTP response headers, and the 
> content of the resource itself (such as HTML pages, images, and other media 
> types).
> The structure of an ARC file generally consists of a file header followed by 
> a series of records, each representing an individual web resource. The ARC 
> file can be gzipped using a two step process where each record in the ARC 
> file is gzipped, and then the entire file is gzipped.
> The original ARC format specification is here:
> [https://archive.org/web/researcher/ArcFileFormat.php]
> The WARC format is currently supported via jwarc, which also appears to have 
> support for the ARC format (https://github.com/iipc/jwarc)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4188) Add support for ARC files

Reply via email to