(tika) branch main updated: TIKA-4747 -- improve pdf and ocr/imagemagick docs. Make sure to include default-parser (#2862)

tallison Wed, 03 Jun 2026 14:16:14 -0700

This is an automated email from the ASF dual-hosted git repository.

tballison pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/tika.git



The following commit(s) were added to refs/heads/main by this push:
     new ecbccdd8c8 TIKA-4747 -- improve pdf and ocr/imagemagick docs. Make 
sure to include default-parser (#2862)
ecbccdd8c8 is described below

commit ecbccdd8c8b11eb57b35b2992aef2abadb7448ec
Author: Tim Allison <[email protected]>
AuthorDate: Wed Jun 3 17:15:57 2026 -0400

    TIKA-4747 -- improve pdf and ocr/imagemagick docs. Make sure to include 
default-parser (#2862)
    
    Co-authored-by: Copilot Autofix powered by AI 
<[email protected]>
---
 docs/modules/ROOT/pages/configuration/index.adoc   | 26 +++++++++++
 .../pages/configuration/parsers/pdf-parser.adoc    |  5 ++-
 .../parsers/tesseract-ocr-parser.adoc              |  6 ++-
 .../pages/migration-to-4x/migrating-to-4x.adoc     | 12 +++---
 .../config-examples/pdf-parser-basic.json          |  4 ++
 .../resources/config-examples/pdf-parser-full.json | 50 ++++++++++++++++++++++
 .../resources/config-examples/tesseract-basic.json |  4 ++
 .../resources/config-examples/tesseract-full.json  | 31 +++++++++++++-
 .../apache/tika/parser/ocr/ImagePreprocessor.java  | 28 +++++++-----
 9 files changed, 145 insertions(+), 21 deletions(-)

diff --git a/docs/modules/ROOT/pages/configuration/index.adoc 
b/docs/modules/ROOT/pages/configuration/index.adoc
index 068fc71c9f..c8cb3ab7e7 100644
--- a/docs/modules/ROOT/pages/configuration/index.adoc
+++ b/docs/modules/ROOT/pages/configuration/index.adoc
@@ -65,6 +65,32 @@ Per-section documentation:
   `plugin-roots` — see xref:pipes/configuration.adoc[Pipes Configuration]
   and xref:pipes/index.adoc[Tika Pipes].
 
+== The `parsers` list and `default-parser`
+
+Tika configuration files are JSON, with optional support for `//` and `/* */` 
comments.
+
+A `parsers` list loads *only* the parsers it names — every other parser is 
dropped.
+To customize one parser while keeping all the others, add a `default-parser` 
entry:
+
+[source,json]
+----
+{
+  "parsers": [
+    { "pdf-parser": { "sortByPosition": true } },
+    { "default-parser": {} }
+  ]
+}
+----
+
+Configuring a parser automatically excludes its default copy, so there is no 
duplication.
+Omit `default-parser` only when you want a Tika limited to the parsers you 
listed.
+
+== Windows file paths
+
+JSON uses the backslash as an escape character, so path options (e.g. 
`tesseractPath`,
+`imageMagickPath`) must use forward slashes (`C:/Tools/...`) or escaped 
backslashes
+(`C:\\Tools\\...`). A single backslash is a JSON parse error.
+
 == Topics
 
 === Parser Configuration
diff --git a/docs/modules/ROOT/pages/configuration/parsers/pdf-parser.adoc 
b/docs/modules/ROOT/pages/configuration/parsers/pdf-parser.adoc
index 1509282e1f..63ffa53c7c 100644
--- a/docs/modules/ROOT/pages/configuration/parsers/pdf-parser.adoc
+++ b/docs/modules/ROOT/pages/configuration/parsers/pdf-parser.adoc
@@ -29,8 +29,9 @@ icon:github[] 
https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers
 
 == Full Configuration
 
-The following example shows all available configuration options with their 
default values.
-Comments indicate the available options for enum fields.
+The example below lists every option with its default value and an inline 
comment describing
+it. It also includes a `default-parser` entry so the config works as-is; see
+xref:configuration/index.adoc[Configuration] for why that entry matters.
 
 [source,json]
 ----
diff --git 
a/docs/modules/ROOT/pages/configuration/parsers/tesseract-ocr-parser.adoc 
b/docs/modules/ROOT/pages/configuration/parsers/tesseract-ocr-parser.adoc
index 389e110766..5cd5188196 100644
--- a/docs/modules/ROOT/pages/configuration/parsers/tesseract-ocr-parser.adoc
+++ b/docs/modules/ROOT/pages/configuration/parsers/tesseract-ocr-parser.adoc
@@ -29,8 +29,10 @@ icon:github[] 
https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers
 
 == Full Configuration
 
-The following example shows all available configuration options with their 
default values.
-Comments indicate the available options for enum fields.
+The example below lists every option with its default value and an inline 
comment describing
+it. It also includes a `default-parser` entry so the config works as-is; see
+xref:configuration/index.adoc[Configuration] for why that entry matters. 
ImageMagick is
+optional — it is only used when `enableImagePreprocessing` or `applyRotation` 
is true.
 
 [source,json]
 ----
diff --git a/docs/modules/ROOT/pages/migration-to-4x/migrating-to-4x.adoc 
b/docs/modules/ROOT/pages/migration-to-4x/migrating-to-4x.adoc
index 2942e18d32..a3dc92901f 100644
--- a/docs/modules/ROOT/pages/migration-to-4x/migrating-to-4x.adoc
+++ b/docs/modules/ROOT/pages/migration-to-4x/migrating-to-4x.adoc
@@ -94,16 +94,18 @@ The converter currently supports:
         "sortByPosition": true,
         "maxMainMemoryBytes": 1000000
       }
+    },
+    {
+      "default-parser": {}
     }
   ]
 }
 ----
 
-NOTE: When you configure a parser with specific settings in JSON, the loader 
automatically
-excludes it from SPI loading. The parser (e.g., `pdf-parser`) is not even 
instantiated in
-`default-parser` if there's a definition for it in the tika-config.json. 
Explicit `exclude`
-directives are only needed when you want to disable a parser entirely without 
providing
-custom configuration.
+NOTE: A `parsers` list loads *only* the parsers it names. The `default-parser` 
entry above
+restores all the other parsers (it is the JSON equivalent of the 3.x 
`DefaultParser`).
+Configuring a parser automatically excludes its default copy, so there is no 
duplication;
+explicit `exclude` directives are only needed to disable a parser without 
replacing it.
 
 === Key Differences
 
diff --git 
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/resources/config-examples/pdf-parser-basic.json
 
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/resources/config-examples/pdf-parser-basic.json
index 591e214ee6..ceafb80e05 100644
--- 
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/resources/config-examples/pdf-parser-basic.json
+++ 
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/resources/config-examples/pdf-parser-basic.json
@@ -5,6 +5,10 @@
         "extractInlineImages": true,
         "sortByPosition": true
       }
+    },
+    {
+      // Keep Tika's other default parsers. Without this, this config is 
PDF-only.
+      "default-parser": {}
     }
   ]
 }
diff --git 
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/resources/config-examples/pdf-parser-full.json
 
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/resources/config-examples/pdf-parser-full.json
index b5446871fc..19aca7fd22 100644
--- 
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/resources/config-examples/pdf-parser-full.json
+++ 
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/resources/config-examples/pdf-parser-full.json
@@ -1,54 +1,104 @@
 {
+  // A "parsers" list loads ONLY the parsers it names; the "default-parser" 
entry at
+  // the bottom keeps all the others. Windows paths in JSON need forward 
slashes or
+  // escaped backslashes.
   "parsers": [
     {
       "pdf-parser": {
+        // Enforce the PDF's access permissions. DONT_CHECK ignores them.
         // Options: DONT_CHECK, ALLOW_EXTRACTION_FOR_ACCESSIBILITY, 
IGNORE_ACCESSIBILITY_ALLOWANCE
         "accessCheckMode": "DONT_CHECK",
+        // Character-width tolerance for inserting spaces (PDFBox).
         "averageCharTolerance": 0.3,
+        // Collect per-stream IOExceptions in metadata and rethrow after 
parsing.
         "catchIntermediateIOExceptions": true,
+        // Detect and correct rotated (angled) text runs within a page.
         "detectAngles": false,
+        // Line-height multiple that starts a new paragraph (PDFBox).
         "dropThreshold": 2.5,
+        // Estimate where spaces belong between words (most PDFs lack explicit 
spaces).
         "enableAutoSpace": true,
+        // Extract AcroForm field content.
         "extractAcroFormContent": true,
+        // Extract PDF actions; JavaScript macros become embedded documents.
         "extractActions": false,
+        // Extract annotation text (comments, form-field captions).
         "extractAnnotationText": true,
+        // Extract outline / bookmark text.
         "extractBookmarksText": true,
+        // Record font names in metadata.
         "extractFontNames": false,
+        // Record metadata about incremental updates (whether present, how 
many).
         "extractIncrementalUpdateInfo": true,
+        // Record inline-image metadata only, without rendering (faster than 
extractInlineImages).
         "extractInlineImageMetadataOnly": false,
+        // Render and extract inline images from content streams.
         "extractInlineImages": false,
+        // Extract marked-content / structure tags, falling back to plain text.
         "extractMarkedContent": false,
+        // Emit each unique inline image (by object id) only once.
         "extractUniqueInlineImagesOnly": true,
+        // If the PDF has an XFA form, process only it.
         "ifXFAExtractOnlyXFA": false,
+        // Ignore content-stream space glyphs; rely on the spacing algorithm 
(PDFBOX-3774).
         "ignoreContentStreamSpaceGlyphs": false,
+        // EXPERT: replace the inline-image factory; give a class implementing
+        // ImageGraphicsEngineFactory, e.g.:
+        //   "imageGraphicsEngineFactoryClass": 
"com.example.MyImageGraphicsEngineFactory"
+        // How to render page images; NONE renders nothing.
         // Options: NONE, RAW_IMAGES, RENDER_PAGES_BEFORE_PARSE, 
RENDER_PAGES_AT_PAGE_END
         "imageStrategy": "NONE",
+        // Max incremental updates to parse when parseIncrementalUpdates is 
true.
         "maxIncrementalUpdates": 10,
+        // Max memory to load a PDF before buffering to a temp file (default 
512MB).
         "maxMainMemoryBytes": 536870912,
+        // Max pages to process; -1 = no limit.
         "maxPages": -1,
+        // OCR settings. Requires an OCR engine (e.g. Tesseract) installed.
         "ocr": {
+          // Render resolution (dpi) for OCR.
           "dpi": 300,
+          // Image format sent to the OCR engine.
           // Options: PNG, TIFF, JPEG
           "imageFormat": "PNG",
+          // Image quality (0.0-1.0) for lossy formats.
           "imageQuality": 1.0,
+          // Rendered-image color model.
           // Options: RGB, GRAY
           "imageType": "GRAY",
+          // Skip OCR for rendered pages larger than this area (w x h); -1 = 
no limit.
+          "maxImagePixels": 100000000,
+          // Max pages to OCR per document; -1 = no limit.
+          "maxPagesToOcr": -1,
+          // Which page content to render for OCR.
           // Options: NO_TEXT, TEXT_ONLY, VECTOR_GRAPHICS_ONLY, ALL
           "renderingStrategy": "ALL",
+          // When to run OCR; AUTO runs it only on text-poor pages.
           // Options: AUTO, NO_OCR, OCR_ONLY, OCR_AND_TEXT_EXTRACTION
           "strategy": "AUTO",
+          // Per-page character thresholds that trigger AUTO OCR.
           "strategyAuto": {
             "totalCharsPerPage": 10,
             "unmappedUnicodeCharsPerPage": 10
           }
         },
+        // Parse prior incremental-update versions as embedded documents.
         "parseIncrementalUpdates": false,
+        // EXPERT: set the Sun KCMS color-management system property. Default 
false.
         "setKCMS": false,
+        // Sort text by x/y position; helps some PDFs, can interleave columns 
in others.
         "sortByPosition": false,
+        // Space-width tolerance for inserting spaces (PDFBox).
         "spacingTolerance": 0.5,
+        // Remove text drawn twice over the same region (faked bold); can be 
slow.
         "suppressDuplicateOverlappingText": false,
+        // Throw on an encrypted payload instead of skipping it.
         "throwOnEncryptedPayload": false
       }
+    },
+    {
+      // Keep Tika's other default parsers. Without this, this config is 
PDF-only.
+      "default-parser": {}
     }
   ]
 }
diff --git 
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/resources/config-examples/tesseract-basic.json
 
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/resources/config-examples/tesseract-basic.json
index f41a367acc..f5df65ee5d 100644
--- 
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/resources/config-examples/tesseract-basic.json
+++ 
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/resources/config-examples/tesseract-basic.json
@@ -5,6 +5,10 @@
         "language": "eng",
         "timeoutSeconds": 120
       }
+    },
+    {
+      // Keep Tika's other default parsers. Without this, only image files are 
OCR'd.
+      "default-parser": {}
     }
   ]
 }
diff --git 
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/resources/config-examples/tesseract-full.json
 
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/resources/config-examples/tesseract-full.json
index 96282dbe14..08b8857425 100644
--- 
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/resources/config-examples/tesseract-full.json
+++ 
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/resources/config-examples/tesseract-full.json
@@ -1,36 +1,65 @@
 {
+  // A "parsers" list loads ONLY the parsers it names; the "default-parser" 
entry at
+  // the bottom keeps all the others (needed so PDFs/Office docs still reach 
OCR).
+  // Windows paths in JSON need forward slashes or escaped backslashes.
   "parsers": [
     {
       "tesseract-ocr-parser": {
+        // Calculate skew and rotate (via ImageMagick) before OCR. Needs 
ImageMagick.
         "applyRotation": false,
+        // Colorspace of the preprocessed image (preprocessing only).
         "colorspace": "gray",
+        // Resolution (dpi) of the preprocessed image.
         "density": 300,
+        // Bits per color sample in the preprocessed image.
         "depth": 4,
+        // Run ImageMagick preprocessing 
(density/depth/colorspace/filter/resize) before OCR.
         "enableImagePreprocessing": false,
+        // ImageMagick resize filter.
         "filter": "triangle",
+        // Directory holding the ImageMagick program (empty = on PATH). 
ImageMagick is
+        // OPTIONAL -- used only when enableImagePreprocessing or 
applyRotation is true.
         "imageMagickPath": "",
+        // Write OCR output from embedded images inline into the parent 
document.
         "inlineContent": false,
+        // Tesseract language(s); join multiple with '+', e.g. "eng+fra".
         "language": "eng",
+        // Skip OCR for files larger than this many bytes.
         "maxFileSizeToOcr": 2147483647,
+        // Skip OCR for files smaller than this many bytes.
         "minFileSizeToOcr": 0,
-        // Additional Tesseract configuration parameters as key-value pairs
+        // Additional raw Tesseract config variables (key-value).
         "otherTesseractConfig": {
           "preserve_interword_spaces": "1",
           "textord_initialx_ile": "0.75",
           "textord_noise_hfract": "0.15625"
         },
+        // Tesseract output format.
         // Options: TXT, HOCR
         "outputType": "TXT",
+        // Inserted between OCR'd pages (empty overrides Tesseract 4's 
form-feed).
         "pageSeparator": "",
+        // Page segmentation mode (0-13); 1 = auto with orientation/script 
detection.
         "pageSegMode": "1",
+        // Load Tesseract language data at startup instead of on first use.
         "preloadLangs": false,
+        // Preserve interword spacing in the output.
         "preserveInterwordSpacing": false,
+        // Scale percent (100-900) applied during preprocessing.
         "resize": 200,
+        // Runtime kill-switch to disable OCR.
         "skipOcr": false,
+        // Directory with tessdata language files (empty = Tesseract default).
         "tessdataPath": "",
+        // Directory with the tesseract binary (empty = on PATH).
         "tesseractPath": "",
+        // Max seconds to wait for the OCR process.
         "timeoutSeconds": 120
       }
+    },
+    {
+      // Keep Tika's other default parsers. Without this, only image files are 
OCR'd.
+      "default-parser": {}
     }
   ]
 }
diff --git 
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-module/src/main/java/org/apache/tika/parser/ocr/ImagePreprocessor.java
 
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-module/src/main/java/org/apache/tika/parser/ocr/ImagePreprocessor.java
index 8932af5ada..37d3fefb95 100644
--- 
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-module/src/main/java/org/apache/tika/parser/ocr/ImagePreprocessor.java
+++ 
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-module/src/main/java/org/apache/tika/parser/ocr/ImagePreprocessor.java
@@ -35,7 +35,6 @@ import org.slf4j.LoggerFactory;
 
 import org.apache.tika.metadata.Metadata;
 import org.apache.tika.parser.ocr.tess4j.ImageDeskew;
-import org.apache.tika.utils.SystemUtils;
 
 class ImagePreprocessor implements Serializable {
 
@@ -58,10 +57,17 @@ class ImagePreprocessor implements Serializable {
 
         if (config.isEnableImagePreprocessing() || (config.isApplyRotation() 
&& angle != 0)) {
             // process the image - parameter values can be set in 
TesseractOCRConfig.properties
+            //
+            // On Windows TesseractOCRParser.getImageMagickProg() returns 
"magick", i.e. the
+            // ImageMagick 7 program. IM7's native command form is
+            //     magick [read settings] input [operators] output
+            // We intentionally do NOT prepend the legacy "convert" 
subcommand: "magick convert"
+            // runs IM7 in deprecated IM6-compatibility mode and emits a 
deprecation warning.
+            // Operators (-depth, -colorspace, -filter, -resize, -rotate) must 
follow the input
+            // image; only read-time settings such as -density may precede it. 
This ordering is
+            // also accepted by the legacy "convert" program used on 
non-Windows systems, so a
+            // single argument layout works on every platform.
             CommandLine commandLine = new CommandLine(fullImageMagickPath);
-            if (SystemUtils.IS_OS_WINDOWS) {
-                commandLine.addArgument("convert");
-            }
 
             // Arguments for ImageMagick
             final List<String> density =
@@ -79,19 +85,19 @@ class ImagePreprocessor implements Serializable {
             Stream<List<String>> stream = Stream.empty();
             if (angle == 0) {
                 if (config.isEnableImagePreprocessing()) {
-                    // Do pre-processing, but don't do any rotation
-                    stream = Stream.of(density, depth, colorspace, filter, 
resize, sourceFileArg,
+                    // Pre-processing, no rotation. -density precedes the 
input; the image
+                    // operators follow it.
+                    stream = Stream.of(density, sourceFileArg, depth, 
colorspace, filter, resize,
                             targFileArg);
                 }
             } else if (config.isEnableImagePreprocessing()) {
-                // Do pre-processing with rotation
-                stream =
-                        Stream.of(density, depth, colorspace, filter, resize, 
rotate, sourceFileArg,
-                                targFileArg);
+                // Pre-processing with rotation
+                stream = Stream.of(density, sourceFileArg, depth, colorspace, 
filter, resize, rotate,
+                        targFileArg);
 
             } else if (config.isApplyRotation()) {
                 // Just rotation
-                stream = Stream.of(rotate, sourceFileArg, targFileArg);
+                stream = Stream.of(sourceFileArg, rotate, targFileArg);
             }
             final String[] args = 
stream.flatMap(Collection::stream).toArray(String[]::new);
             commandLine.addArguments(args, true);

(tika) branch main updated: TIKA-4747 -- improve pdf and ocr/imagemagick docs. Make sure to include default-parser (#2862)

Reply via email to