[
https://issues.apache.org/jira/browse/TIKA-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15264550#comment-15264550
]
Tim Allison commented on TIKA-1963:
-----------------------------------
The other issue is that Tesseract doesn't literally run on PDFs; it will run on
embedded images, though, as you know (btw make sure that you've configured
extraction from inline images)
So, for example, I'm not sure there is a way now to have the Tesseract parser
act on a tiff embedded in a PDF but _not_ a tiff on its own. You'd have to do
an initial detect on the main file to determine that it is a PDF and then
process it.
Perhaps I'm misunderstanding, though...
> Configuring Parsers: "high degree of control over which parsers are or aren't
> used" does not work
> -------------------------------------------------------------------------------------------------
>
> Key: TIKA-1963
> URL: https://issues.apache.org/jira/browse/TIKA-1963
> Project: Tika
> Issue Type: Bug
> Components: config
> Affects Versions: 1.12
> Environment: windows, java version "1.8.0_73", 64 bit
> Reporter: Konstantin Avdeev
>
> Hi everybody!
> I'm trying to white-list a particular mime-type for OCR with the following
> config:
> {code}
> <properties>
> <parsers>
> <parser class="org.apache.tika.parser.DefaultParser">
> <mime-exclude>application/pdf</mime-exclude>
> <parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
> </parser>
> <parser class="org.apache.tika.parser.pdf.PDFParser">
> <mime>application/pdf</mime>
> </parser>
> </parsers>
> </properties>
> {code}
> So, the idea is - to enable the Tesseract parser for PDF format only.
> But this configuration disables the Tesseract completely.
> Is it the expected behaviour or a bug?
> Thank you!
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)