[
https://issues.apache.org/jira/browse/TIKA-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16008628#comment-16008628
]
Chris A. Mattmann commented on TIKA-2359:
-----------------------------------------
This is a tough one. In general I'd be fine to add a parameter in the tesseract
config that's a boolean org.apache.tika.parser.ocr.tesseract.enable (default
"false"). That said, to do so, would inhibit those since TIKA-93 that expect if
they install Tesseract, Tika picks it up, and uses it. So, it would be an
extremely non-back compat change b/c now we would require users to install some
config file, update their java sysprops, or tika config parameters, which isn't
nice at all. Part of the convenience of Tika "picking up" tesseract is that it
is zero config, zero maintenance.
Any change to this needs careful thought, documentation updates on the wiki,
in CHANGES.txt, and convenience scripts, etc, that make it extremely painless
for the one time upgrade, and going forward to use OCR with Tika. I am in the
boat of users that depends/relies on this by default if tesseract is
available/installed.
Consider the opposite - would it be so hard to simply add a property to turn it
on/off, and have it on by default (and then allow it to be disabled with e.g.,
java -Dorg.apache.tika.parser.ocr.tesseract=false? To me that's easier, handles
the back compat better, and is less intrusive.
My 2c.
> Extreme slow parsing on the attachment attached
> -----------------------------------------------
>
> Key: TIKA-2359
> URL: https://issues.apache.org/jira/browse/TIKA-2359
> Project: Tika
> Issue Type: Bug
> Components: parser
> Reporter: Eugen Mayer
> Attachments: Sample-doc-file-2000kb.doc
>
>
> i have 93s for parsing this document using 1.14 in server or in cli mode.
> Java:
> java version "1.8.0_121"
> Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
> debian-jessie, 8GB ram in a docker container, current xeon 3GHz, so decent (2
> cores limited)
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)