(tika) 03/07: TIKA-4637 -- update docs

tallison Sat, 31 Jan 2026 16:39:44 -0800

This is an automated email from the ASF dual-hosted git repository.

tallison pushed a commit to branch TIKA-4637
in repository https://gitbox.apache.org/repos/asf/tika.git


commit 8c52fea53577945d9e9e71a9d003529a174e04f9
Author: tallison <[email protected]>
AuthorDate: Sat Jan 31 18:09:36 2026 -0500

    TIKA-4637 -- update docs
---
 docs/modules/ROOT/nav.adoc                       |   1 +
 docs/modules/ROOT/pages/pipes/index.adoc         |   2 +
 docs/modules/ROOT/pages/pipes/unpack-config.adoc | 202 +++++++++++++++++++++++
 3 files changed, 205 insertions(+)

diff --git a/docs/modules/ROOT/nav.adoc b/docs/modules/ROOT/nav.adoc
index ef0165c69b..1bca9526b1 100644
--- a/docs/modules/ROOT/nav.adoc
+++ b/docs/modules/ROOT/nav.adoc
@@ -20,6 +20,7 @@
 ** xref:using-tika/cli/index.adoc[Command Line]
 ** xref:using-tika/grpc/index.adoc[gRPC]
 * xref:pipes/index.adoc[Pipes]
+** xref:pipes/unpack-config.adoc[Extracting Embedded Bytes]
 * xref:configuration/index.adoc[Configuration]
 ** xref:configuration/parsers/pdf-parser.adoc[PDF Parser]
 ** xref:configuration/parsers/tesseract-ocr-parser.adoc[Tesseract OCR]
diff --git a/docs/modules/ROOT/pages/pipes/index.adoc 
b/docs/modules/ROOT/pages/pipes/index.adoc
index e7b49ebc3c..588ae7d9ab 100644
--- a/docs/modules/ROOT/pages/pipes/index.adoc
+++ b/docs/modules/ROOT/pages/pipes/index.adoc
@@ -29,6 +29,8 @@ Tika Pipes provides a framework for processing large volumes 
of documents with:
 
 == Topics
 
+* xref:pipes/unpack-config.adoc[Extracting Embedded Bytes] - Extract raw bytes 
from embedded documents using `ParseMode.UNPACK`
+
 // Add links to specific topics as they are created
 // * link:getting-started.html[Getting Started]
 // * link:fetchers.html[Fetchers]
diff --git a/docs/modules/ROOT/pages/pipes/unpack-config.adoc 
b/docs/modules/ROOT/pages/pipes/unpack-config.adoc
new file mode 100644
index 0000000000..ce9ddd159d
--- /dev/null
+++ b/docs/modules/ROOT/pages/pipes/unpack-config.adoc
@@ -0,0 +1,202 @@
+//
+// Licensed to the Apache Software Foundation (ASF) under one or more
+// contributor license agreements.  See the NOTICE file distributed with
+// this work for additional information regarding copyright ownership.
+// The ASF licenses this file to You under the Apache License, Version 2.0
+// (the "License"); you may not use this file except in compliance with
+// the License.  You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+//
+
+= UnpackConfig: Extracting Embedded Document Bytes
+
+When processing container files (ZIP, DOCX, PDF with attachments, etc.), you 
may want to
+extract the raw bytes of embedded documents in addition to parsing them. 
`UnpackConfig`
+controls how embedded bytes are extracted and emitted.
+
+== Quick Start
+
+Use `ParseMode.UNPACK` to automatically extract embedded document bytes:
+
+[source,json]
+----
+{
+  "id": "doc1",
+  "fetchKey": {"fetcherId": "fsf", "fetchKey": "container.docx"},
+  "emitKey": {"emitterId": "fse", "emitKey": "container.docx"},
+  "parseContext": {
+    "parseMode": "UNPACK"
+  }
+}
+----
+
+This extracts both metadata (like `RMETA` mode) and embedded document bytes.
+
+== Configuration Options
+
+[cols="2,1,2,3"]
+|===
+|Property |Type |Default |Description
+
+|`emitter`
+|String
+|_(from FetchEmitTuple)_
+|Emitter name for embedded bytes. Falls back to the FetchEmitTuple's emitterId.
+
+|`maxUnpackBytes`
+|long
+|10GB
+|Maximum total bytes to extract per file. Set to `-1` for unlimited (not 
recommended).
+
+|`includeOriginal`
+|boolean
+|`false`
+|Include the container document itself in the output.
+
+|`zipEmbeddedFiles`
+|boolean
+|`false`
+|Collect all embedded files into a single ZIP archive.
+
+|`includeMetadataInZip`
+|boolean
+|`false`
+|Include `.metadata.json` files for each embedded document in the ZIP.
+
+|`zeroPadName`
+|int
+|`0`
+|Zero-pad embedded IDs in output names (e.g., `8` produces `00000001`).
+
+|`suffixStrategy`
+|NONE, EXISTING, DETECTED
+|`NONE`
+|How to determine file extensions for extracted files.
+
+|`embeddedIdPrefix`
+|String
+|`"-"`
+|Prefix between base name and embedded ID (e.g., `doc-1.txt`).
+
+|`keyBaseStrategy`
+|DEFAULT, CUSTOM
+|`DEFAULT`
+|Strategy for generating emit keys.
+
+|`emitKeyBase`
+|String
+|`""`
+|Custom base path when `keyBaseStrategy=CUSTOM`.
+|===
+
+== Examples
+
+=== Basic Byte Extraction
+
+Extract embedded bytes with default naming:
+
+[source,json]
+----
+{
+  "parseContext": {
+    "parseMode": "UNPACK"
+  }
+}
+----
+
+=== ZIP Output with Metadata
+
+Collect all embedded files into a ZIP with metadata:
+
+[source,json]
+----
+{
+  "parseContext": {
+    "parseMode": "UNPACK",
+    "unpack-config": {
+      "zipEmbeddedFiles": true,
+      "includeMetadataInZip": true,
+      "includeOriginal": true
+    }
+  }
+}
+----
+
+=== Custom Naming
+
+Control output file naming:
+
+[source,json]
+----
+{
+  "parseContext": {
+    "parseMode": "UNPACK",
+    "unpack-config": {
+      "zeroPadName": 8,
+      "suffixStrategy": "DETECTED",
+      "embeddedIdPrefix": "-embed-"
+    }
+  }
+}
+----
+
+Produces names like: `document-embed-00000001.pdf`
+
+=== Limit Extraction Size
+
+Prevent unbounded extraction from malicious files:
+
+[source,json]
+----
+{
+  "parseContext": {
+    "parseMode": "UNPACK",
+    "unpack-config": {
+      "maxUnpackBytes": 104857600
+    }
+  }
+}
+----
+
+This limits extraction to 100MB total.
+
+== Suffix Strategies
+
+`NONE`:: No file extension added to extracted files.
+`EXISTING`:: Use the file extension from the embedded document's resource name.
+`DETECTED`:: Use the file extension based on the detected MIME type.
+
+== Key Base Strategies
+
+`DEFAULT`:: Output key is `{containerKey}-{embeddedIdPrefix}{id}{suffix}`
+`CUSTOM`:: Output key uses `emitKeyBase` as the prefix.
+
+== Safety Limits
+
+The `maxUnpackBytes` setting protects against zip bombs and other malicious 
files that
+expand to enormous sizes. The default 10GB limit should be appropriate for 
most use cases.
+
+When the limit is reached:
+
+* Extraction stops for the current file
+* An exception is logged
+* Parsing continues (already-extracted bytes are kept)
+* The parse result status is `PARSE_SUCCESS_WITH_EXCEPTION`
+
+Set `maxUnpackBytes=-1` to disable the limit. This is not recommended for 
untrusted input.
+
+== Code Examples
+
+For working code examples, see:
+
+* 
`tika-pipes/tika-pipes-integration-tests/src/test/java/org/apache/tika/pipes/core/UnpackModeTest.java`
+* 
`tika-server/tika-server-standard/src/test/java/org/apache/tika/server/standard/TikaPipesTest.java`
+
+These test files demonstrate all configuration options with assertions.

(tika) 03/07: TIKA-4637 -- update docs

Reply via email to