[
https://issues.apache.org/jira/browse/TIKA-3298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17283115#comment-17283115
]
Tim Allison commented on TIKA-3298:
-----------------------------------
In a previous life, I've made the mistake of thinking the stream gobbler was
completed if the process was completed. When that happens, the output from the
process can be truncated, and it will likely be truncated at different places
depending on the whims of timing. I added a .join() to the gobbler thread in
the --list-langs processing so that _should_ wait the right amount of time to
gather the langs from {{-list-langs}}, but let me know if you are seeing
truncated values.
> Add a "preloadLangs" parameter to TesseractOCRParser
> ----------------------------------------------------
>
> Key: TIKA-3298
> URL: https://issues.apache.org/jira/browse/TIKA-3298
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Minor
> Fix For: 2.0.0
>
> Attachments: image-2021-02-10-18-59-47-793.png,
> image-2021-02-10-19-00-10-691.png, image-2021-02-11-08-56-38-712.png
>
>
> [~peterkronenberg] on the user/dev lists and on TIKA-3297 and TIKA-3296 has
> observed that the tesseract error message for "lang data doesn't exist" is
> not extremely clear. We could add a "preloadLangs" option to
> TesseractOCRParser (default would be {{false}}). If set to true, the parser
> (upon initialization) if it finds tesseract, will call {{tesseract
> --list-langs}} and then store those langs. At parse time, if the langs set
> has anything in it, the TesseractOCRParser will check that set against the
> user-requested language and throw a clearer exception to the user that the
> language data doesn't exist for the requested language.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)