[
https://issues.apache.org/jira/browse/TIKA-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16007718#comment-16007718
]
Eugen Mayer commented on TIKA-2359:
-----------------------------------
Guys as far as i understood you just explained that you
1. Are not using the easy deterministic, fast extraction libs if the are
installed by default ( as a design decision )
2. But you are using the most expensive, not deterministic on, OCR, by default.
No matter what this means for legacy issue, think about the decision process
here - i would say this needs to be fixed. Not for me - i got this, but believe
me, i am using TIKA for 4 years now and thats the first time i stumbeled uppon
this - i am living with this waste of time for 4 years now.
This just fools your user base and makes you even look bad performance wise - i
was comparing tike to other doc/pdf to text libs which performed better and was
about to switch - because i had not idea i compare apple with oranges ( OCR vs
plaintext ).
To give you a number, the example document take 93s with the defaults (so with
OCR) and 0.9s without. We are talking about roughly 100x slower.
> Extreme slow parsing on the attachment attached
> -----------------------------------------------
>
> Key: TIKA-2359
> URL: https://issues.apache.org/jira/browse/TIKA-2359
> Project: Tika
> Issue Type: Bug
> Components: parser
> Reporter: Eugen Mayer
> Attachments: Sample-doc-file-2000kb.doc
>
>
> i have 93s for parsing this document using 1.14 in server or in cli mode.
> Java:
> java version "1.8.0_121"
> Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
> debian-jessie, 8GB ram in a docker container, current xeon 3GHz, so decent (2
> cores limited)
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)