This is an automated email from the ASF dual-hosted git repository.
tallison pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/tika.git
The following commit(s) were added to refs/heads/main by this push:
new 46b002d97e TIKA-4645 - part 2, general updates for alpha release --
update docs
46b002d97e is described below
commit 46b002d97ef33740c6700f15560fe30ec1fb124e
Author: tallison <[email protected]>
AuthorDate: Mon Feb 2 14:22:04 2026 -0500
TIKA-4645 - part 2, general updates for alpha release -- update docs
---
docs/modules/ROOT/pages/pipes/unpack-config.adoc | 71 ++++++++++++++++++++++
docs/modules/ROOT/pages/roadmap.adoc | 6 +-
docs/modules/ROOT/pages/security.adoc | 47 ++++++++++++++
.../ROOT/pages/using-tika/server/index.adoc | 50 +++++++++++++++
.../tika/parser/external/ExternalParser.java | 4 ++
5 files changed, 175 insertions(+), 3 deletions(-)
diff --git a/docs/modules/ROOT/pages/pipes/unpack-config.adoc
b/docs/modules/ROOT/pages/pipes/unpack-config.adoc
index ce9ddd159d..45f15cdb1c 100644
--- a/docs/modules/ROOT/pages/pipes/unpack-config.adoc
+++ b/docs/modules/ROOT/pages/pipes/unpack-config.adoc
@@ -192,6 +192,77 @@ When the limit is reached:
Set `maxUnpackBytes=-1` to disable the limit. This is not recommended for
untrusted input.
+== Frictionless Data Package Output
+
+The UNPACK mode can output files in https://frictionlessdata.io/[Frictionless
Data Package] format,
+a standard for packaging data files with their metadata. This format includes
a `datapackage.json`
+manifest with file checksums and MIME types, making it easy to verify and
process extracted files.
+
+=== Enabling Frictionless Output
+
+Set `outputFormat` to `FRICTIONLESS` in your UnpackConfig:
+
+[source,json]
+----
+{
+ "parseContext": {
+ "parseMode": "UNPACK",
+ "unpack-config": {
+ "outputFormat": "FRICTIONLESS",
+ "includeFullMetadata": true
+ }
+ }
+}
+----
+
+=== Output Structure
+
+When using Frictionless output format, the ZIP archive contains:
+
+[source]
+----
+output.zip
+├── datapackage.json # Manifest with file list, SHA256 hashes, mimetypes
+├── metadata.json # Full RMETA metadata (if includeFullMetadata=true)
+└── unpacked/
+ ├── 00000001.pdf
+ ├── 00000002.png
+ └── ...
+----
+
+The `datapackage.json` file contains:
+
+* List of all extracted files as "resources"
+* SHA256 hash for each file
+* MIME type for each file
+* File size in bytes
+
+=== Frictionless Configuration Options
+
+[cols="2,1,2,3"]
+|===
+|Property |Type |Default |Description
+
+|`outputFormat`
+|STANDARD, FRICTIONLESS
+|`STANDARD`
+|Output format for the ZIP archive. Use `FRICTIONLESS` for Data Package format.
+
+|`includeFullMetadata`
+|boolean
+|`false`
+|Include a `metadata.json` file with full RMETA-style metadata for all
extracted files.
+|===
+
+=== CLI Usage
+
+Extract files in Frictionless format using the CLI:
+
+[source,bash]
+----
+java -jar tika-app.jar --unpack --unpack-format=FRICTIONLESS -i input.docx -o
output/
+----
+
== Code Examples
For working code examples, see:
diff --git a/docs/modules/ROOT/pages/roadmap.adoc
b/docs/modules/ROOT/pages/roadmap.adoc
index 3e28829a43..f58fa4905e 100644
--- a/docs/modules/ROOT/pages/roadmap.adoc
+++ b/docs/modules/ROOT/pages/roadmap.adoc
@@ -37,10 +37,10 @@ with traditional calendars.
|April 2025
|End support for 2.x (and Java 8)
-|January 2026
+|March 2026
|Release 4.0.0
-|June 2026
+|March 2027
|End support for 3.x (and Java 11)
|===
@@ -65,7 +65,7 @@ with traditional calendars.
|4.x
|17
|jakarta
-|January 2026
+|March 2026
|TBD
|5.x
diff --git a/docs/modules/ROOT/pages/security.adoc
b/docs/modules/ROOT/pages/security.adoc
index ddc09b7215..0357714508 100644
--- a/docs/modules/ROOT/pages/security.adoc
+++ b/docs/modules/ROOT/pages/security.adoc
@@ -32,3 +32,50 @@ For information about known security vulnerabilities (CVEs)
in Apache Tika and t
remediation, please see:
* https://tika.apache.org/security.html[Apache Tika Security Vulnerabilities]
+
+== External Command Security
+
+Apache Tika can be configured to use external system commands for certain
operations,
+such as the `FileCommandDetector` and `ExternalParser` components.
+
+CAUTION: External command configuration should only be performed by trusted
administrators.
+Never allow untrusted users to configure command paths or arguments.
+
+=== Security Best Practices
+
+1. **Restrict configuration access**: Only allow administrators to modify Tika
configuration
+ files that specify external commands.
+
+2. **Use absolute paths**: Always configure external commands with absolute
paths to prevent
+ PATH manipulation attacks.
+
+3. **Sandbox execution**: Consider running Tika in a container or sandbox
environment to
+ limit the impact of any command execution vulnerabilities.
+
+4. **Audit command configuration**: Regularly review configured external
commands and
+ their arguments.
+
+=== Affected Components
+
+* `FileCommandDetector`: Uses the system `file` command for MIME type detection
+* `ExternalParser`: Executes arbitrary external programs to extract content
+* `ExternalEmbedder`: Uses external tools to embed content
+
+== Credential Handling
+
+=== Password Storage in Memory
+
+Tika stores some credentials as Java String objects, which remain in memory
until
+garbage collected. For environments with strict security requirements:
+
+1. **Use environment variables**: Configure credentials via environment
variables
+ rather than configuration files where possible.
+
+2. **Use secret managers**: Integrate with HashiCorp Vault, AWS Secrets
Manager,
+ or similar services for production deployments.
+
+3. **Enable encryption**: Use the AES encryption option in `HttpClientFactory`
+ for stored passwords.
+
+4. **Minimize credential scope**: Use credentials with minimum necessary
privileges
+ and rotate them regularly.
diff --git a/docs/modules/ROOT/pages/using-tika/server/index.adoc
b/docs/modules/ROOT/pages/using-tika/server/index.adoc
index 59017b943b..dbb086b5e6 100644
--- a/docs/modules/ROOT/pages/using-tika/server/index.adoc
+++ b/docs/modules/ROOT/pages/using-tika/server/index.adoc
@@ -36,3 +36,53 @@ The server starts on port 9998 by default.
== Topics
* xref:using-tika/server/tls.adoc[TLS/SSL Configuration] - Secure your server
with TLS and mutual authentication
+
+== Security Configuration
+
+=== Config Endpoint Protection
+
+By default, the `/config` endpoints that expose server configuration are
disabled for security
+reasons. These endpoints can reveal sensitive information about your server
configuration,
+including parser settings and system properties (see
https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2015-3271[CVE-2015-3271]).
+
+The protected endpoints include:
+
+* `/config` - Returns the server's full configuration
+* `/config/parsers` - Returns configured parsers
+* `/config/detectors` - Returns configured detectors
+* `/config/mimeTypes` - Returns MIME type mappings
+
+=== Enabling Config Endpoints
+
+To enable these endpoints:
+
+[source,json]
+----
+{
+ "server": {
+ "enableUnsecureFeatures": true
+ }
+}
+----
+
+WARNING: Only enable `enableUnsecureFeatures` if you have secured access to
Tika Server through
+network controls (firewalls, private subnets), a reverse proxy (nginx, Apache
httpd), or
+xref:using-tika/server/tls.adoc[2-way TLS authentication]. Exposing config
endpoints to
+untrusted networks can help attackers identify vulnerabilities and craft
targeted attacks.
+
+=== Command Line Usage
+
+You can also enable unsecure features via command line:
+
+[source,bash]
+----
+java -jar tika-server-standard.jar --enableUnsecureFeatures
+----
+
+=== Security Best Practices
+
+1. **Keep config endpoints disabled** in production (default behavior)
+2. **Use network controls** to restrict access to the Tika Server (firewall
rules, private subnets)
+3. **Consider TLS** for encrypted communication - see
xref:using-tika/server/tls.adoc[TLS Configuration]
+4. **Run with minimal privileges** - don't run Tika Server as root
+5. **Monitor logs** for unusual access patterns
diff --git
a/tika-core/src/main/java/org/apache/tika/parser/external/ExternalParser.java
b/tika-core/src/main/java/org/apache/tika/parser/external/ExternalParser.java
index 38d4f61563..0e17384928 100644
---
a/tika-core/src/main/java/org/apache/tika/parser/external/ExternalParser.java
+++
b/tika-core/src/main/java/org/apache/tika/parser/external/ExternalParser.java
@@ -55,7 +55,11 @@ import org.apache.tika.sax.XHTMLContentHandler;
/**
* Parser that uses an external program (like catdoc or pdf2txt) to extract
* text content and metadata from a given document.
+ *
+ * @deprecated Use {@link org.apache.tika.parser.external2.ExternalParser}
instead.
+ * This class will be removed in a future version of Tika.
*/
+@Deprecated
public class ExternalParser implements Parser {
private static final Logger LOG =
LoggerFactory.getLogger(ExternalParser.class);