Tim Allison created TIKA-4059:
---------------------------------
Summary: Consider parsing common gzipped formats like we do with
package files
Key: TIKA-4059
URL: https://issues.apache.org/jira/browse/TIKA-4059
Project: Tika
Issue Type: Improvement
Reporter: Tim Allison
For docx and zip-based formats, we have a zip detector and we parse those
container files as a single file. There are a handful of file formats that are
often gzipped: tgz, svgz and warc files.
Users currently get the content of these files as an attachment to the main
gzipped file with /rmeta or the -J option in tika-app.
This issue proposes adding a simple gzip container detector to treat these file
formats as a single file.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)