[
https://issues.apache.org/jira/browse/TIKA-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16007718#comment-16007718
]
Eugen Mayer edited comment on TIKA-2359 at 5/12/17 7:03 AM:
------------------------------------------------------------
Guys as far as i understood you just explained that you
1. Are not using the easy deterministic, fast extraction libs if the are
installed by default ( as a design decision ) - like exiftool, string and others
2. But you are using the most expensive, not deterministic on, OCR, by default.
No matter what this means for legacy users, think about the decision process
here - i would say this needs to be fixed. Not for me - i got this, but believe
me, i am using TIKA for 4 years now and thats the first time i stumbled upon
this - i am living with this waste of time for 4 years now.
This just fools your user base and makes you even look bad performance wise - i
was comparing tike to other doc/pdf to text libs which performed better and was
about to switch - because i had not idea i compare apple with oranges ( OCR vs
plaintext ).
To give you a number, the example document take 93s with the defaults (so with
OCR) and 0.9s without. We are talking about roughly 100x slower.
was (Author: eugenmayer):
Guys as far as i understood you just explained that you
1. Are not using the easy deterministic, fast extraction libs if the are
installed by default ( as a design decision )
2. But you are using the most expensive, not deterministic on, OCR, by default.
No matter what this means for legacy issue, think about the decision process
here - i would say this needs to be fixed. Not for me - i got this, but believe
me, i am using TIKA for 4 years now and thats the first time i stumbeled uppon
this - i am living with this waste of time for 4 years now.
This just fools your user base and makes you even look bad performance wise - i
was comparing tike to other doc/pdf to text libs which performed better and was
about to switch - because i had not idea i compare apple with oranges ( OCR vs
plaintext ).
To give you a number, the example document take 93s with the defaults (so with
OCR) and 0.9s without. We are talking about roughly 100x slower.
> Extreme slow parsing on the attachment attached
> -----------------------------------------------
>
> Key: TIKA-2359
> URL: https://issues.apache.org/jira/browse/TIKA-2359
> Project: Tika
> Issue Type: Bug
> Components: parser
> Reporter: Eugen Mayer
> Attachments: Sample-doc-file-2000kb.doc
>
>
> i have 93s for parsing this document using 1.14 in server or in cli mode.
> Java:
> java version "1.8.0_121"
> Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
> debian-jessie, 8GB ram in a docker container, current xeon 3GHz, so decent (2
> cores limited)
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)