[jira] [Created] (TIKA-2970) Configuring Tesseract for OCR of PDF via Tika Config is not working

David Eric Pugh (Jira) Sun, 20 Oct 2019 13:11:46 -0700

David Eric Pugh created TIKA-2970:
-------------------------------------

             Summary: Configuring Tesseract for OCR of PDF via Tika Config is 
not working
                 Key: TIKA-2970
                 URL: https://issues.apache.org/jira/browse/TIKA-2970
             Project: Tika
          Issue Type: Improvement
          Components: ocr
    Affects Versions: 1.22
            Reporter: David Eric Pugh



Based on TIKA-2705, I thought I could eliminate the use of the properties files 
for configuring PDF and OCR processing, and just use a tika-config.xml file.

I believe I have a unit test that demonstrates that if you need to override the 
tesseract path for OCR, you end up always with the default Tesseract 
configuration, which leads to Tika throwing an error: 
https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/pdf/AbstractPDF2XHTML.java#L328
   

In stepping through the code, it seems like every time we consult the context:

```
TesseractOCRConfig tesseractConfig =
                context.get(TesseractOCRConfig.class, DEFAULT_TESSERACT_CONFIG);
```
We always get back the default.  The context never has our customized 
TesseractOCRConfig!   Despite the fact that when we load up the TikaConfig in 
the first case, I notice that we do create a TesseractOCRParser object WITH the 
various parameters...   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (TIKA-2970) Configuring Tesseract for OCR of PDF via Tika Config is not working

Reply via email to