This is an automated email from the ASF dual-hosted git repository. tballison pushed a commit to branch TIKA-4750-docs in repository https://gitbox.apache.org/repos/asf/tika.git
commit 631057cea3f17ea367cc7ab941ed90b39e2e746b Author: tallison <[email protected]> AuthorDate: Sat Jun 6 07:58:55 2026 -0400 TIKA-4750 - improve docs --- docs/modules/ROOT/nav.adoc | 2 +- docs/modules/ROOT/pages/configuration/index.adoc | 2 +- .../pages/configuration/parsers/tess4j-parser.adoc | 21 +++++++++++++++++++++ .../apache/tika/parser/ocr/tess4j/Tess4JParser.java | 12 ++++++++++++ 4 files changed, 35 insertions(+), 2 deletions(-) diff --git a/docs/modules/ROOT/nav.adoc b/docs/modules/ROOT/nav.adoc index 070f535ff8..3c7ae7a011 100644 --- a/docs/modules/ROOT/nav.adoc +++ b/docs/modules/ROOT/nav.adoc @@ -54,7 +54,7 @@ ** xref:configuration/parsers/tesseract-ocr-parser.adoc[Tesseract OCR] ** xref:configuration/parsers/vlm-parsers.adoc[VLM Parsers (Claude, Gemini, OpenAI)] ** xref:configuration/parsers/external-parser.adoc[External Parser (ffmpeg, exiftool, etc.)] -** xref:configuration/parsers/tess4j-parser.adoc[Tess4J OCR (In-Process)] +** xref:configuration/parsers/tess4j-parser.adoc[Tess4J OCR (In-Process, advanced)] * xref:migration-to-4x/index.adoc[Migration to 4.x] ** xref:migration-to-4x/migrating-to-4x.adoc[Migration Guide] ** xref:migration-to-4x/migrating-tika-server-4x.adoc[Tika Server Migration] diff --git a/docs/modules/ROOT/pages/configuration/index.adoc b/docs/modules/ROOT/pages/configuration/index.adoc index c8cb3ab7e7..09039e72db 100644 --- a/docs/modules/ROOT/pages/configuration/index.adoc +++ b/docs/modules/ROOT/pages/configuration/index.adoc @@ -97,7 +97,7 @@ JSON uses the backslash as an escape character, so path options (e.g. `tesseract * xref:configuration/parsers/pdf-parser.adoc[PDFParser] — PDF parsing options * xref:configuration/parsers/tesseract-ocr-parser.adoc[TesseractOCRParser] — OCR options for image-based text extraction -* xref:configuration/parsers/tess4j-parser.adoc[Tess4J OCR Parser] — in-process OCR via tess4j JNI bindings +* xref:configuration/parsers/tess4j-parser.adoc[Tess4J OCR Parser] — in-process OCR via tess4j JNI bindings (advanced users only; most users should prefer the TesseractOCRParser above) * xref:configuration/parsers/vlm-parsers.adoc[VLM Parsers] — Claude, Gemini, OpenAI, Ollama, vLLM * xref:configuration/parsers/external-parser.adoc[External Parser] — wrap external tools (ffmpeg, exiftool, etc.) diff --git a/docs/modules/ROOT/pages/configuration/parsers/tess4j-parser.adoc b/docs/modules/ROOT/pages/configuration/parsers/tess4j-parser.adoc index 4dccac2c4d..83b0050c05 100644 --- a/docs/modules/ROOT/pages/configuration/parsers/tess4j-parser.adoc +++ b/docs/modules/ROOT/pages/configuration/parsers/tess4j-parser.adoc @@ -17,6 +17,27 @@ = Tess4J OCR Parser +[IMPORTANT] +==== +*Advanced users only.* `Tess4JParser` loads the Tesseract native library +directly into your JVM through https://github.com/java-native-access/jna[JNA] +(Java Native Access). Operating it safely means locating and linking the +correct platform-specific native libraries, reasoning about the Java/native +boundary, and accepting that a fault in the native code can crash the entire +JVM. In short, it assumes you are comfortable working with native-library +integration (JNA/JNI). + +If that doesn't describe you, please don't reach for this parser — and that's +perfectly fine. The standard +xref:configuration/parsers/tesseract-ocr-parser.adoc[`TesseractOCRParser`] +performs the same OCR by running the `tesseract` command-line program in a +separate process. It needs no native linking, is far easier to set up, and a +crash in Tesseract can never take down your application, so it is the +recommended choice for almost everyone. Choose `Tess4JParser` only when you +have a measured need for in-process OCR throughput *and* the expertise to run +native bindings safely. +==== + The `Tess4JParser` is an OCR parser that calls the Tesseract native library in-process via https://github.com/nguyenq/tess4j[Tess4J] and JNA, rather than spawning a `tesseract` child process for every image. This eliminates diff --git a/tika-parsers/tika-parsers-ml/tika-parser-tess4j-module/src/main/java/org/apache/tika/parser/ocr/tess4j/Tess4JParser.java b/tika-parsers/tika-parsers-ml/tika-parser-tess4j-module/src/main/java/org/apache/tika/parser/ocr/tess4j/Tess4JParser.java index d02952e6bb..136e41fcbd 100644 --- a/tika-parsers/tika-parsers-ml/tika-parser-tess4j-module/src/main/java/org/apache/tika/parser/ocr/tess4j/Tess4JParser.java +++ b/tika-parsers/tika-parsers-ml/tika-parser-tess4j-module/src/main/java/org/apache/tika/parser/ocr/tess4j/Tess4JParser.java @@ -58,6 +58,18 @@ import org.apache.tika.utils.StringUtils; /** * OCR parser using <a href="https://github.com/nguyenq/tess4j">Tess4J</a>, * which provides a Java JNA wrapper around the native Tesseract library. + * + * <p><b>Advanced users only.</b> This parser loads the Tesseract native library + * directly into the JVM via JNA (Java Native Access). Using it safely requires + * locating and linking the correct platform-specific native libraries and + * accepting that a fault in the native code can crash the entire JVM. If you are + * not comfortable with native-library integration (JNA/JNI), please prefer the + * standard {@code TesseractOCRParser}, which performs the same OCR by running the + * {@code tesseract} command-line program in a separate process: it needs no + * native linking and a crash in Tesseract can never take down your application, + * so it is the recommended choice for almost everyone. Reach for + * {@code Tess4JParser} only when you have a measured need for in-process OCR + * throughput <em>and</em> the expertise to operate native bindings safely. * <p> * Unlike the command-line {@code TesseractOCRParser}, this parser calls Tesseract * in-process via JNA, eliminating the per-file process-spawn overhead.
