[jira] [Comment Edited] (TIKA-2359) Extreme slow parsing on the attachment attached

Eugen Mayer (JIRA) Fri, 12 May 2017 00:04:28 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16007718#comment-16007718
 ]


Eugen Mayer edited comment on TIKA-2359 at 5/12/17 7:03 AM:
------------------------------------------------------------

Guys as far as i understood you just explained that  you

1. Are not using the easy deterministic, fast extraction libs if the are 
installed by default ( as a design decision ) - like exiftool, string and others
2. But you are using the most expensive, not deterministic on, OCR, by default.

No matter what this means for legacy users, think about the decision process 
here - i would say this needs to be fixed. Not for me - i got this, but believe 
me, i am using TIKA for 4 years now and thats the first time i stumbled upon 
this - i am living with this waste of time for 4 years now.

This just fools your user base and makes you even look bad performance wise - i 
was comparing tike to other doc/pdf to text libs which performed better and was 
about to switch - because i had not idea i compare apple with oranges ( OCR vs 
plaintext ).

To give you a number, the example document take 93s with the defaults (so with 
OCR) and  0.9s without. We are talking about roughly 100x slower.




was (Author: eugenmayer):
Guys as far as i understood you just explained that  you

1. Are not using the easy deterministic, fast extraction libs if the are 
installed by default ( as a design decision )
2. But you are using the most expensive, not deterministic on, OCR, by default.

No matter what this means for legacy issue, think about the decision process 
here - i would say this needs to be fixed. Not for me - i got this, but believe 
me, i am using TIKA for 4 years now and thats the first time i stumbeled uppon 
this - i am living with this waste of time for 4 years now.

This just fools your user base and makes you even look bad performance wise - i 
was comparing tike to other doc/pdf to text libs which performed better and was 
about to switch - because i had not idea i compare apple with oranges ( OCR vs 
plaintext ).

To give you a number, the example document take 93s with the defaults (so with 
OCR) and  0.9s without. We are talking about roughly 100x slower.



> Extreme slow parsing on the attachment attached
> -----------------------------------------------
>
>                 Key: TIKA-2359
>                 URL: https://issues.apache.org/jira/browse/TIKA-2359
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Eugen Mayer
>         Attachments: Sample-doc-file-2000kb.doc
>
>
> i have 93s for parsing this document using 1.14 in server or in cli mode.
> Java:
> java version "1.8.0_121"
> Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
> debian-jessie, 8GB ram in a docker container, current xeon 3GHz, so decent (2 
> cores limited)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Comment Edited] (TIKA-2359) Extreme slow parsing on the attachment attached

Reply via email to