[
https://issues.apache.org/jira/browse/TIKA-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16008774#comment-16008774
]
Luis Filipe Nassif commented on TIKA-2359:
------------------------------------------
Hi Cris, thank you!
I think this issue demonstrates a lot of users can have ocr on their systems
not for Tika and they will get a 100X performance slowdown without knowledge
about that. So the original hypothesis thrown in Tika-93 that tesseract is
uncommon and if it is there it is for Tika is wrong. New users (and some old!)
may not know they have to set a Java system property to get 100X speed up.
For users that need ocr it also should be simple to set a Java Runtime
property. Of course this is a breaking change that must be documented all
around, on wiki, release notes, site announcement, even logged. For users
missing all those warnings, I think it is more likely they will note the
breaking change and search for the option to get ocr back than a new user of
Tika searching for an option to get performance speed up or to disable some ocr
that they do not know about.
So I propose for 1.15 add some logging saying "ocr is on and can cause severe
slowdowns and it Will be disabled by default in 1.16". So users will have more
time to know about that.
> Extreme slow parsing on the attachment attached
> -----------------------------------------------
>
> Key: TIKA-2359
> URL: https://issues.apache.org/jira/browse/TIKA-2359
> Project: Tika
> Issue Type: Bug
> Components: parser
> Reporter: Eugen Mayer
> Attachments: Sample-doc-file-2000kb.doc
>
>
> i have 93s for parsing this document using 1.14 in server or in cli mode.
> Java:
> java version "1.8.0_121"
> Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
> debian-jessie, 8GB ram in a docker container, current xeon 3GHz, so decent (2
> cores limited)
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)