[ 
https://issues.apache.org/jira/browse/TIKA-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16008774#comment-16008774
 ] 

Luis Filipe Nassif edited comment on TIKA-2359 at 5/12/17 10:08 PM:
--------------------------------------------------------------------

Hi Chris, thank you!

I think this issue demonstrates a lot of users can have ocr on their systems 
not for Tika and they will get a 100X performance slowdown without knowledge 
about that. So the original hypothesis thrown in Tika-93 that tesseract is 
uncommon and if it is there it is for Tika is wrong. New users (and some old!) 
may not know they have to set a Java system property to get 100X speed up.

For users that need ocr it also should be simple to set a Java Runtime 
property. Of course this is a breaking change that must be documented all 
around, on wiki, release notes, site announcement, even logged. For users 
missing all those warnings, I think it is more likely they will note the 
breaking change and search for the option to get ocr back than a new user of 
Tika searching for an option to get performance speed up or to disable some ocr 
that they do not know about.

So I propose for 1.15 add some logging saying "ocr is on and can cause severe 
slowdowns and it Will be disabled by default in 1.16". So users will have more 
time to know about that.


was (Author: lfcnassif):
Hi Cris, thank you!

I think this issue demonstrates a lot of users can have ocr on their systems 
not for Tika and they will get a 100X performance slowdown without knowledge 
about that. So the original hypothesis thrown in Tika-93 that tesseract is 
uncommon and if it is there it is for Tika is wrong. New users (and some old!) 
may not know they have to set a Java system property to get 100X speed up.

For users that need ocr it also should be simple to set a Java Runtime 
property. Of course this is a breaking change that must be documented all 
around, on wiki, release notes, site announcement, even logged. For users 
missing all those warnings, I think it is more likely they will note the 
breaking change and search for the option to get ocr back than a new user of 
Tika searching for an option to get performance speed up or to disable some ocr 
that they do not know about.

So I propose for 1.15 add some logging saying "ocr is on and can cause severe 
slowdowns and it Will be disabled by default in 1.16". So users will have more 
time to know about that.

> Extreme slow parsing on the attachment attached
> -----------------------------------------------
>
>                 Key: TIKA-2359
>                 URL: https://issues.apache.org/jira/browse/TIKA-2359
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Eugen Mayer
>         Attachments: Sample-doc-file-2000kb.doc
>
>
> i have 93s for parsing this document using 1.14 in server or in cli mode.
> Java:
> java version "1.8.0_121"
> Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
> debian-jessie, 8GB ram in a docker container, current xeon 3GHz, so decent (2 
> cores limited)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to