[
https://issues.apache.org/jira/browse/TIKA-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16008782#comment-16008782
]
Chris A. Mattmann commented on TIKA-2359:
-----------------------------------------
Hi [~lfcnassif] great points.
Your point here:
bq. I think it is more likely they will note the breaking change and search for
the option to get ocr back than a new user of Tika searching for an option to
get performance speed up or to disable some ocr that they do not know about.
I am not so sure about. In fact, the data tells me the opposite. We haven't had
hundreds of JIRAs filed by users who find Tika to be slow. In fact, quite the
opposite, and OCR has been on (if tesseract is installed - so it's not "by
default", but if you have Tesseract installed, either known or unknown) for
quite a few releases now.
I'm happy to have a waiting period to consider this. I also say I think it's
just as easy either way - that is to set a system property to either enable, or
disable OCR. For me, since it's been "enabled" if Tesseract is installed (big
"if") and that's been the expectation, I would say that we ought to stay with
that, and then help the handful of users that have suggested performance is an
issue in tickets like this by making that minority set the option as a command
line parameter. I would be a big +1 as you say either way to have logging say
"OCR is on, did you really want that?" or something like that.
> Extreme slow parsing on the attachment attached
> -----------------------------------------------
>
> Key: TIKA-2359
> URL: https://issues.apache.org/jira/browse/TIKA-2359
> Project: Tika
> Issue Type: Bug
> Components: parser
> Reporter: Eugen Mayer
> Attachments: Sample-doc-file-2000kb.doc
>
>
> i have 93s for parsing this document using 1.14 in server or in cli mode.
> Java:
> java version "1.8.0_121"
> Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
> debian-jessie, 8GB ram in a docker container, current xeon 3GHz, so decent (2
> cores limited)
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)