[ https://issues.apache.org/jira/browse/TIKA-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17816165#comment-17816165 ]
Hudson commented on TIKA-4188: ------------------------------ SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk11 #1502 (See [https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1502/]) TIKA-4188 (#1587) (github: [https://github.com/apache/tika/commit/7d48d00ac1febfb1ac70e4887268b28fb4951b78]) * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pkg-module/src/main/java/org/apache/tika/detect/gzip/GZipSpecializationDetector.java * (add) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-webarchive-module/src/test/resources/test-documents/testARC.arc * (add) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-webarchive-module/src/test/resources/test-documents/example.arc.gz * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-webarchive-module/src/main/java/org/apache/tika/parser/warc/WARCParser.java * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-webarchive-module/src/test/java/org/apache/tika/parser/warc/WARCParserTest.java > Add support for ARC files > ------------------------- > > Key: TIKA-4188 > URL: https://issues.apache.org/jira/browse/TIKA-4188 > Project: Tika > Issue Type: Improvement > Reporter: Gregory Lepore > Priority: Minor > Fix For: 3.0.0 > > > The original version of the Internet Archive's storage format is the ARC > format (later superseded by WARC and WACZ). > The ARC (Archive) format is a file format used for storing web archives. It > was developed by the Internet Archive to facilitate the mass storage of web > pages, capturing the content as it appeared on the Internet at specific > points in time. An ARC file is a single, large file that contains a sequence > of archived web resources. Each entry in an ARC file includes the URL of the > resource, the date it was captured, the HTTP response headers, and the > content of the resource itself (such as HTML pages, images, and other media > types). > The structure of an ARC file generally consists of a file header followed by > a series of records, each representing an individual web resource. The ARC > file can be gzipped using a two step process where each record in the ARC > file is gzipped, and then the entire file is gzipped. > The original ARC format specification is here: > [https://archive.org/web/researcher/ArcFileFormat.php] > The WARC format is currently supported via jwarc, which also appears to have > support for the ARC format (https://github.com/iipc/jwarc) -- This message was sent by Atlassian Jira (v8.20.10#820010)