[
https://issues.apache.org/jira/browse/TIKA-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16006689#comment-16006689
]
Eugen Mayer commented on TIKA-2359:
-----------------------------------
oh holy..seriously? By default OCR by simply having a lib installed which is
installed by libreoffice? This is incredibly odd, seriously.
for the googlers
cat /etc/tika.xml
<?xml version="1.0" encoding="UTF-8"?>
<properties>
<parsers>
<parser class="org.apache.tika.parser.DefaultParser">
<parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
</parser>
</parsers>
</properties>
export TIKA_CONFIG=/etc/tika.xm
And the just run
java -jar tika.jar test.pdf
> Extreme slow parsing on the attachment attached
> -----------------------------------------------
>
> Key: TIKA-2359
> URL: https://issues.apache.org/jira/browse/TIKA-2359
> Project: Tika
> Issue Type: Bug
> Components: parser
> Reporter: Eugen Mayer
> Attachments: Sample-doc-file-2000kb.doc
>
>
> i have 93s for parsing this document using 1.14 in server or in cli mode.
> Java:
> java version "1.8.0_121"
> Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
> debian-jessie, 8GB ram in a docker container, current xeon 3GHz, so decent (2
> cores limited)
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)