[jira] [Comment Edited] (TIKA-1963) Configuring Parsers: "high degree of control over which parsers are or aren't used" does not work

Konstantin Avdeev (JIRA) Sat, 30 Apr 2016 11:55:45 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15265402#comment-15265402
 ]


Konstantin Avdeev edited comment on TIKA-1963 at 4/30/16 6:55 PM:
------------------------------------------------------------------

The format of the configuration file is not well described (yes, I agree, 
writing a documentation is boring :)), that's why I'm working from assumptions.
Assumption 1: <mime-exclude> and <parser-exclude> belonging to one 
parent-element are independent 
Assumption 2: <parser> elements are independent too.

So, if the assumptions above are true, then I do not understand, why the second 
<parser> definition (for PDFParser) does use the settings (exclude 
TesseractParser) from the first <parser> element (DefaultParser).

I'm hoping, the guy who wrote the code would read the ticket and could clarify 
the behavior and could shed a light on how to enable OCR for PDF only. Thanks!

P.S. OCR working fine without this config: "extractInlineImages true" in 
PDFParser.properties and "tesseractPath=/path/to/tesseract.exe" in 
TesseractOCRConfig.properties are set.



was (Author: kavdeev):
The format of the configuration file is not well described (yes, I agree, 
writing a documentation is boring :)), that's why I'm working from assumptions.
Assumption 1: <mime-exclude> and <parser-exclude> belonging to one 
parent-element are independent 
Assumption 2: <parser-exclude> elements are independent too.

So, if the assumptions above are true, then I do not understand, why the second 
<parser> definition (for PDFParser) does use the settings (exclude 
TesseractParser) from the first <parser> element (DefaultParser).

I'm hoping, the guy who wrote the code would read the ticket and could clarify 
the behavior and could shed a light on how to enable OCR for PDF only. Thanks!

P.S. OCR working fine without this config: "extractInlineImages true" in 
PDFParser.properties and "tesseractPath=/path/to/tesseract.exe" in 
TesseractOCRConfig.properties are set.


> Configuring Parsers: "high degree of control over which parsers are or aren't 
> used" does not work
> -------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-1963
>                 URL: https://issues.apache.org/jira/browse/TIKA-1963
>             Project: Tika
>          Issue Type: Bug
>          Components: config
>    Affects Versions: 1.12
>         Environment: windows, java version "1.8.0_73", 64 bit
>            Reporter: Konstantin Avdeev
>
> Hi everybody!
> I'm trying to white-list a particular mime-type for OCR with the following 
> config:
> {code}
> <properties>
>   <parsers>
>     <parser class="org.apache.tika.parser.DefaultParser">
>       <mime-exclude>application/pdf</mime-exclude>
>       <parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
>     </parser>
>     <parser class="org.apache.tika.parser.pdf.PDFParser">
>       <mime>application/pdf</mime>
>     </parser>
>   </parsers>
> </properties>
> {code}
> So, the idea is - to enable the Tesseract parser for PDF format only.
> But this configuration disables the Tesseract completely.
> Is it the expected behaviour or a bug?
> Thank you!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TIKA-1963) Configuring Parsers: "high degree of control over which parsers are or aren't used" does not work

Reply via email to