Eric Pugh created TIKA-2106:
-------------------------------
Summary: "hocr" case on Linux fails, but works on OSX. Related to
TIKA-2093
Key: TIKA-2106
URL: https://issues.apache.org/jira/browse/TIKA-2106
Project: Tika
Issue Type: Bug
Components: ocr
Environment: Bug in Linux, but fine in OSX.
Reporter: Eric Pugh
We pass a output type, either TXT or HOCR to the Tesseract command line. When
we call the command line we lowercase it to "txt" or "hocr". However, when we
read back in the output, we don't lower case it. on OSX the constructed file
path "output.HOCR" is actually found, but in Linux it doesn't. This patch
lower cases the HOCR to hocr and TXT to txt in the constructed file path.
I didn't write a unit test as I don't have a good linux env to test it in, but
I was able to put a patched version of the Tika Parser Jar into my Docker Build
to test it works.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)