This is an automated email from the ASF dual-hosted git repository.
tballison pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/tika.git
The following commit(s) were added to refs/heads/main by this push:
new ecbccdd8c8 TIKA-4747 -- improve pdf and ocr/imagemagick docs. Make
sure to include default-parser (#2862)
ecbccdd8c8 is described below
commit ecbccdd8c8b11eb57b35b2992aef2abadb7448ec
Author: Tim Allison <[email protected]>
AuthorDate: Wed Jun 3 17:15:57 2026 -0400
TIKA-4747 -- improve pdf and ocr/imagemagick docs. Make sure to include
default-parser (#2862)
Co-authored-by: Copilot Autofix powered by AI
<[email protected]>
---
docs/modules/ROOT/pages/configuration/index.adoc | 26 +++++++++++
.../pages/configuration/parsers/pdf-parser.adoc | 5 ++-
.../parsers/tesseract-ocr-parser.adoc | 6 ++-
.../pages/migration-to-4x/migrating-to-4x.adoc | 12 +++---
.../config-examples/pdf-parser-basic.json | 4 ++
.../resources/config-examples/pdf-parser-full.json | 50 ++++++++++++++++++++++
.../resources/config-examples/tesseract-basic.json | 4 ++
.../resources/config-examples/tesseract-full.json | 31 +++++++++++++-
.../apache/tika/parser/ocr/ImagePreprocessor.java | 28 +++++++-----
9 files changed, 145 insertions(+), 21 deletions(-)
diff --git a/docs/modules/ROOT/pages/configuration/index.adoc
b/docs/modules/ROOT/pages/configuration/index.adoc
index 068fc71c9f..c8cb3ab7e7 100644
--- a/docs/modules/ROOT/pages/configuration/index.adoc
+++ b/docs/modules/ROOT/pages/configuration/index.adoc
@@ -65,6 +65,32 @@ Per-section documentation:
`plugin-roots` — see xref:pipes/configuration.adoc[Pipes Configuration]
and xref:pipes/index.adoc[Tika Pipes].
+== The `parsers` list and `default-parser`
+
+Tika configuration files are JSON, with optional support for `//` and `/* */`
comments.
+
+A `parsers` list loads *only* the parsers it names — every other parser is
dropped.
+To customize one parser while keeping all the others, add a `default-parser`
entry:
+
+[source,json]
+----
+{
+ "parsers": [
+ { "pdf-parser": { "sortByPosition": true } },
+ { "default-parser": {} }
+ ]
+}
+----
+
+Configuring a parser automatically excludes its default copy, so there is no
duplication.
+Omit `default-parser` only when you want a Tika limited to the parsers you
listed.
+
+== Windows file paths
+
+JSON uses the backslash as an escape character, so path options (e.g.
`tesseractPath`,
+`imageMagickPath`) must use forward slashes (`C:/Tools/...`) or escaped
backslashes
+(`C:\\Tools\\...`). A single backslash is a JSON parse error.
+
== Topics
=== Parser Configuration
diff --git a/docs/modules/ROOT/pages/configuration/parsers/pdf-parser.adoc
b/docs/modules/ROOT/pages/configuration/parsers/pdf-parser.adoc
index 1509282e1f..63ffa53c7c 100644
--- a/docs/modules/ROOT/pages/configuration/parsers/pdf-parser.adoc
+++ b/docs/modules/ROOT/pages/configuration/parsers/pdf-parser.adoc
@@ -29,8 +29,9 @@ icon:github[]
https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers
== Full Configuration
-The following example shows all available configuration options with their
default values.
-Comments indicate the available options for enum fields.
+The example below lists every option with its default value and an inline
comment describing
+it. It also includes a `default-parser` entry so the config works as-is; see
+xref:configuration/index.adoc[Configuration] for why that entry matters.
[source,json]
----
diff --git
a/docs/modules/ROOT/pages/configuration/parsers/tesseract-ocr-parser.adoc
b/docs/modules/ROOT/pages/configuration/parsers/tesseract-ocr-parser.adoc
index 389e110766..5cd5188196 100644
--- a/docs/modules/ROOT/pages/configuration/parsers/tesseract-ocr-parser.adoc
+++ b/docs/modules/ROOT/pages/configuration/parsers/tesseract-ocr-parser.adoc
@@ -29,8 +29,10 @@ icon:github[]
https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers
== Full Configuration
-The following example shows all available configuration options with their
default values.
-Comments indicate the available options for enum fields.
+The example below lists every option with its default value and an inline
comment describing
+it. It also includes a `default-parser` entry so the config works as-is; see
+xref:configuration/index.adoc[Configuration] for why that entry matters.
ImageMagick is
+optional — it is only used when `enableImagePreprocessing` or `applyRotation`
is true.
[source,json]
----
diff --git a/docs/modules/ROOT/pages/migration-to-4x/migrating-to-4x.adoc
b/docs/modules/ROOT/pages/migration-to-4x/migrating-to-4x.adoc
index 2942e18d32..a3dc92901f 100644
--- a/docs/modules/ROOT/pages/migration-to-4x/migrating-to-4x.adoc
+++ b/docs/modules/ROOT/pages/migration-to-4x/migrating-to-4x.adoc
@@ -94,16 +94,18 @@ The converter currently supports:
"sortByPosition": true,
"maxMainMemoryBytes": 1000000
}
+ },
+ {
+ "default-parser": {}
}
]
}
----
-NOTE: When you configure a parser with specific settings in JSON, the loader
automatically
-excludes it from SPI loading. The parser (e.g., `pdf-parser`) is not even
instantiated in
-`default-parser` if there's a definition for it in the tika-config.json.
Explicit `exclude`
-directives are only needed when you want to disable a parser entirely without
providing
-custom configuration.
+NOTE: A `parsers` list loads *only* the parsers it names. The `default-parser`
entry above
+restores all the other parsers (it is the JSON equivalent of the 3.x
`DefaultParser`).
+Configuring a parser automatically excludes its default copy, so there is no
duplication;
+explicit `exclude` directives are only needed to disable a parser without
replacing it.
=== Key Differences
diff --git
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/resources/config-examples/pdf-parser-basic.json
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/resources/config-examples/pdf-parser-basic.json
index 591e214ee6..ceafb80e05 100644
---
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/resources/config-examples/pdf-parser-basic.json
+++
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/resources/config-examples/pdf-parser-basic.json
@@ -5,6 +5,10 @@
"extractInlineImages": true,
"sortByPosition": true
}
+ },
+ {
+ // Keep Tika's other default parsers. Without this, this config is
PDF-only.
+ "default-parser": {}
}
]
}
diff --git
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/resources/config-examples/pdf-parser-full.json
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/resources/config-examples/pdf-parser-full.json
index b5446871fc..19aca7fd22 100644
---
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/resources/config-examples/pdf-parser-full.json
+++
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/resources/config-examples/pdf-parser-full.json
@@ -1,54 +1,104 @@
{
+ // A "parsers" list loads ONLY the parsers it names; the "default-parser"
entry at
+ // the bottom keeps all the others. Windows paths in JSON need forward
slashes or
+ // escaped backslashes.
"parsers": [
{
"pdf-parser": {
+ // Enforce the PDF's access permissions. DONT_CHECK ignores them.
// Options: DONT_CHECK, ALLOW_EXTRACTION_FOR_ACCESSIBILITY,
IGNORE_ACCESSIBILITY_ALLOWANCE
"accessCheckMode": "DONT_CHECK",
+ // Character-width tolerance for inserting spaces (PDFBox).
"averageCharTolerance": 0.3,
+ // Collect per-stream IOExceptions in metadata and rethrow after
parsing.
"catchIntermediateIOExceptions": true,
+ // Detect and correct rotated (angled) text runs within a page.
"detectAngles": false,
+ // Line-height multiple that starts a new paragraph (PDFBox).
"dropThreshold": 2.5,
+ // Estimate where spaces belong between words (most PDFs lack explicit
spaces).
"enableAutoSpace": true,
+ // Extract AcroForm field content.
"extractAcroFormContent": true,
+ // Extract PDF actions; JavaScript macros become embedded documents.
"extractActions": false,
+ // Extract annotation text (comments, form-field captions).
"extractAnnotationText": true,
+ // Extract outline / bookmark text.
"extractBookmarksText": true,
+ // Record font names in metadata.
"extractFontNames": false,
+ // Record metadata about incremental updates (whether present, how
many).
"extractIncrementalUpdateInfo": true,
+ // Record inline-image metadata only, without rendering (faster than
extractInlineImages).
"extractInlineImageMetadataOnly": false,
+ // Render and extract inline images from content streams.
"extractInlineImages": false,
+ // Extract marked-content / structure tags, falling back to plain text.
"extractMarkedContent": false,
+ // Emit each unique inline image (by object id) only once.
"extractUniqueInlineImagesOnly": true,
+ // If the PDF has an XFA form, process only it.
"ifXFAExtractOnlyXFA": false,
+ // Ignore content-stream space glyphs; rely on the spacing algorithm
(PDFBOX-3774).
"ignoreContentStreamSpaceGlyphs": false,
+ // EXPERT: replace the inline-image factory; give a class implementing
+ // ImageGraphicsEngineFactory, e.g.:
+ // "imageGraphicsEngineFactoryClass":
"com.example.MyImageGraphicsEngineFactory"
+ // How to render page images; NONE renders nothing.
// Options: NONE, RAW_IMAGES, RENDER_PAGES_BEFORE_PARSE,
RENDER_PAGES_AT_PAGE_END
"imageStrategy": "NONE",
+ // Max incremental updates to parse when parseIncrementalUpdates is
true.
"maxIncrementalUpdates": 10,
+ // Max memory to load a PDF before buffering to a temp file (default
512MB).
"maxMainMemoryBytes": 536870912,
+ // Max pages to process; -1 = no limit.
"maxPages": -1,
+ // OCR settings. Requires an OCR engine (e.g. Tesseract) installed.
"ocr": {
+ // Render resolution (dpi) for OCR.
"dpi": 300,
+ // Image format sent to the OCR engine.
// Options: PNG, TIFF, JPEG
"imageFormat": "PNG",
+ // Image quality (0.0-1.0) for lossy formats.
"imageQuality": 1.0,
+ // Rendered-image color model.
// Options: RGB, GRAY
"imageType": "GRAY",
+ // Skip OCR for rendered pages larger than this area (w x h); -1 =
no limit.
+ "maxImagePixels": 100000000,
+ // Max pages to OCR per document; -1 = no limit.
+ "maxPagesToOcr": -1,
+ // Which page content to render for OCR.
// Options: NO_TEXT, TEXT_ONLY, VECTOR_GRAPHICS_ONLY, ALL
"renderingStrategy": "ALL",
+ // When to run OCR; AUTO runs it only on text-poor pages.
// Options: AUTO, NO_OCR, OCR_ONLY, OCR_AND_TEXT_EXTRACTION
"strategy": "AUTO",
+ // Per-page character thresholds that trigger AUTO OCR.
"strategyAuto": {
"totalCharsPerPage": 10,
"unmappedUnicodeCharsPerPage": 10
}
},
+ // Parse prior incremental-update versions as embedded documents.
"parseIncrementalUpdates": false,
+ // EXPERT: set the Sun KCMS color-management system property. Default
false.
"setKCMS": false,
+ // Sort text by x/y position; helps some PDFs, can interleave columns
in others.
"sortByPosition": false,
+ // Space-width tolerance for inserting spaces (PDFBox).
"spacingTolerance": 0.5,
+ // Remove text drawn twice over the same region (faked bold); can be
slow.
"suppressDuplicateOverlappingText": false,
+ // Throw on an encrypted payload instead of skipping it.
"throwOnEncryptedPayload": false
}
+ },
+ {
+ // Keep Tika's other default parsers. Without this, this config is
PDF-only.
+ "default-parser": {}
}
]
}
diff --git
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/resources/config-examples/tesseract-basic.json
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/resources/config-examples/tesseract-basic.json
index f41a367acc..f5df65ee5d 100644
---
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/resources/config-examples/tesseract-basic.json
+++
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/resources/config-examples/tesseract-basic.json
@@ -5,6 +5,10 @@
"language": "eng",
"timeoutSeconds": 120
}
+ },
+ {
+ // Keep Tika's other default parsers. Without this, only image files are
OCR'd.
+ "default-parser": {}
}
]
}
diff --git
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/resources/config-examples/tesseract-full.json
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/resources/config-examples/tesseract-full.json
index 96282dbe14..08b8857425 100644
---
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/resources/config-examples/tesseract-full.json
+++
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/resources/config-examples/tesseract-full.json
@@ -1,36 +1,65 @@
{
+ // A "parsers" list loads ONLY the parsers it names; the "default-parser"
entry at
+ // the bottom keeps all the others (needed so PDFs/Office docs still reach
OCR).
+ // Windows paths in JSON need forward slashes or escaped backslashes.
"parsers": [
{
"tesseract-ocr-parser": {
+ // Calculate skew and rotate (via ImageMagick) before OCR. Needs
ImageMagick.
"applyRotation": false,
+ // Colorspace of the preprocessed image (preprocessing only).
"colorspace": "gray",
+ // Resolution (dpi) of the preprocessed image.
"density": 300,
+ // Bits per color sample in the preprocessed image.
"depth": 4,
+ // Run ImageMagick preprocessing
(density/depth/colorspace/filter/resize) before OCR.
"enableImagePreprocessing": false,
+ // ImageMagick resize filter.
"filter": "triangle",
+ // Directory holding the ImageMagick program (empty = on PATH).
ImageMagick is
+ // OPTIONAL -- used only when enableImagePreprocessing or
applyRotation is true.
"imageMagickPath": "",
+ // Write OCR output from embedded images inline into the parent
document.
"inlineContent": false,
+ // Tesseract language(s); join multiple with '+', e.g. "eng+fra".
"language": "eng",
+ // Skip OCR for files larger than this many bytes.
"maxFileSizeToOcr": 2147483647,
+ // Skip OCR for files smaller than this many bytes.
"minFileSizeToOcr": 0,
- // Additional Tesseract configuration parameters as key-value pairs
+ // Additional raw Tesseract config variables (key-value).
"otherTesseractConfig": {
"preserve_interword_spaces": "1",
"textord_initialx_ile": "0.75",
"textord_noise_hfract": "0.15625"
},
+ // Tesseract output format.
// Options: TXT, HOCR
"outputType": "TXT",
+ // Inserted between OCR'd pages (empty overrides Tesseract 4's
form-feed).
"pageSeparator": "",
+ // Page segmentation mode (0-13); 1 = auto with orientation/script
detection.
"pageSegMode": "1",
+ // Load Tesseract language data at startup instead of on first use.
"preloadLangs": false,
+ // Preserve interword spacing in the output.
"preserveInterwordSpacing": false,
+ // Scale percent (100-900) applied during preprocessing.
"resize": 200,
+ // Runtime kill-switch to disable OCR.
"skipOcr": false,
+ // Directory with tessdata language files (empty = Tesseract default).
"tessdataPath": "",
+ // Directory with the tesseract binary (empty = on PATH).
"tesseractPath": "",
+ // Max seconds to wait for the OCR process.
"timeoutSeconds": 120
}
+ },
+ {
+ // Keep Tika's other default parsers. Without this, only image files are
OCR'd.
+ "default-parser": {}
}
]
}
diff --git
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-module/src/main/java/org/apache/tika/parser/ocr/ImagePreprocessor.java
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-module/src/main/java/org/apache/tika/parser/ocr/ImagePreprocessor.java
index 8932af5ada..37d3fefb95 100644
---
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-module/src/main/java/org/apache/tika/parser/ocr/ImagePreprocessor.java
+++
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-module/src/main/java/org/apache/tika/parser/ocr/ImagePreprocessor.java
@@ -35,7 +35,6 @@ import org.slf4j.LoggerFactory;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ocr.tess4j.ImageDeskew;
-import org.apache.tika.utils.SystemUtils;
class ImagePreprocessor implements Serializable {
@@ -58,10 +57,17 @@ class ImagePreprocessor implements Serializable {
if (config.isEnableImagePreprocessing() || (config.isApplyRotation()
&& angle != 0)) {
// process the image - parameter values can be set in
TesseractOCRConfig.properties
+ //
+ // On Windows TesseractOCRParser.getImageMagickProg() returns
"magick", i.e. the
+ // ImageMagick 7 program. IM7's native command form is
+ // magick [read settings] input [operators] output
+ // We intentionally do NOT prepend the legacy "convert"
subcommand: "magick convert"
+ // runs IM7 in deprecated IM6-compatibility mode and emits a
deprecation warning.
+ // Operators (-depth, -colorspace, -filter, -resize, -rotate) must
follow the input
+ // image; only read-time settings such as -density may precede it.
This ordering is
+ // also accepted by the legacy "convert" program used on
non-Windows systems, so a
+ // single argument layout works on every platform.
CommandLine commandLine = new CommandLine(fullImageMagickPath);
- if (SystemUtils.IS_OS_WINDOWS) {
- commandLine.addArgument("convert");
- }
// Arguments for ImageMagick
final List<String> density =
@@ -79,19 +85,19 @@ class ImagePreprocessor implements Serializable {
Stream<List<String>> stream = Stream.empty();
if (angle == 0) {
if (config.isEnableImagePreprocessing()) {
- // Do pre-processing, but don't do any rotation
- stream = Stream.of(density, depth, colorspace, filter,
resize, sourceFileArg,
+ // Pre-processing, no rotation. -density precedes the
input; the image
+ // operators follow it.
+ stream = Stream.of(density, sourceFileArg, depth,
colorspace, filter, resize,
targFileArg);
}
} else if (config.isEnableImagePreprocessing()) {
- // Do pre-processing with rotation
- stream =
- Stream.of(density, depth, colorspace, filter, resize,
rotate, sourceFileArg,
- targFileArg);
+ // Pre-processing with rotation
+ stream = Stream.of(density, sourceFileArg, depth, colorspace,
filter, resize, rotate,
+ targFileArg);
} else if (config.isApplyRotation()) {
// Just rotation
- stream = Stream.of(rotate, sourceFileArg, targFileArg);
+ stream = Stream.of(sourceFileArg, rotate, targFileArg);
}
final String[] args =
stream.flatMap(Collection::stream).toArray(String[]::new);
commandLine.addArguments(args, true);