Konstantin Avdeev created TIKA-1963:
---------------------------------------
Summary: Configuring Parsers: "high degree of control over which
parsers are or aren't used" does not work
Key: TIKA-1963
URL: https://issues.apache.org/jira/browse/TIKA-1963
Project: Tika
Issue Type: Bug
Components: config
Affects Versions: 1.12
Environment: windows, java version "1.8.0_73", 64 bit
Reporter: Konstantin Avdeev
Hi everybody!
I'm trying to white-list a particular mime-type for OCR with the following
config:
{code}
<properties>
<parsers>
<parser class="org.apache.tika.parser.DefaultParser">
<mime-exclude>application/pdf</mime-exclude>
<parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
</parser>
<parser class="org.apache.tika.parser.pdf.PDFParser">
<mime>application/pdf</mime>
</parser>
</parsers>
</properties>
{code}
So, the idea is - to enable the Tesseract parser for PDF format only.
But this configuration disables the Tesseract completely.
Is it the expected behaviour or a bug?
Thank you!
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)