[ 
https://issues.apache.org/jira/browse/TIKA-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17815012#comment-17815012
 ] 

Gregory Lepore commented on TIKA-4188:
--------------------------------------

The ones I'm working with are concatenated gzips streams in a gzip, which does 
allow for random access to offsets. I'm not familiar with the three records 
type, they're not in the ones I've got. I also thought you could just pass the 
file to jwarc.

 

I'm using warcio to parse the files in Python, and it's working well. They have 
several arc sample files at:
[https://github.com/webrecorder/warcio/tree/master/test/data]

including the gzipped type.

 

I can also share some sample files with you offline.

> Add support for ARC files
> -------------------------
>
>                 Key: TIKA-4188
>                 URL: https://issues.apache.org/jira/browse/TIKA-4188
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Gregory Lepore
>            Priority: Minor
>
> The original version of the Internet Archive's storage format is the ARC 
> format (later superseded by WARC and WACZ). 
> The ARC (Archive) format is a file format used for storing web archives. It 
> was developed by the Internet Archive to facilitate the mass storage of web 
> pages, capturing the content as it appeared on the Internet at specific 
> points in time. An ARC file is a single, large file that contains a sequence 
> of archived web resources. Each entry in an ARC file includes the URL of the 
> resource, the date it was captured, the HTTP response headers, and the 
> content of the resource itself (such as HTML pages, images, and other media 
> types).
> The structure of an ARC file generally consists of a file header followed by 
> a series of records, each representing an individual web resource. The ARC 
> file can be gzipped using a two step process where each record in the ARC 
> file is gzipped, and then the entire file is gzipped.
> The original ARC format specification is here:
> [https://archive.org/web/researcher/ArcFileFormat.php]
> The WARC format is currently supported via jwarc, which also appears to have 
> support for the ARC format (https://github.com/iipc/jwarc)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to