David Eric Pugh created TIKA-2970:
-------------------------------------
Summary: Configuring Tesseract for OCR of PDF via Tika Config is
not working
Key: TIKA-2970
URL: https://issues.apache.org/jira/browse/TIKA-2970
Project: Tika
Issue Type: Improvement
Components: ocr
Affects Versions: 1.22
Reporter: David Eric Pugh
Based on TIKA-2705, I thought I could eliminate the use of the properties files
for configuring PDF and OCR processing, and just use a tika-config.xml file.
I believe I have a unit test that demonstrates that if you need to override the
tesseract path for OCR, you end up always with the default Tesseract
configuration, which leads to Tika throwing an error:
https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/pdf/AbstractPDF2XHTML.java#L328
In stepping through the code, it seems like every time we consult the context:
```
TesseractOCRConfig tesseractConfig =
context.get(TesseractOCRConfig.class, DEFAULT_TESSERACT_CONFIG);
```
We always get back the default. The context never has our customized
TesseractOCRConfig! Despite the fact that when we load up the TikaConfig in
the first case, I notice that we do create a TesseractOCRParser object WITH the
various parameters...
--
This message was sent by Atlassian Jira
(v8.3.4#803005)