Eric Pugh created TIKA-2106:
-------------------------------

             Summary: "hocr" case on Linux fails, but works on OSX.  Related to 
TIKA-2093
                 Key: TIKA-2106
                 URL: https://issues.apache.org/jira/browse/TIKA-2106
             Project: Tika
          Issue Type: Bug
          Components: ocr
         Environment: Bug in Linux, but fine in OSX.
            Reporter: Eric Pugh


We pass a output type, either TXT or HOCR to the Tesseract command line.   When 
we call the command line we lowercase it to "txt" or "hocr".  However, when we 
read back in the output, we don't lower case it.  on OSX the constructed file 
path "output.HOCR" is actually found, but in Linux it doesn't.  This patch 
lower cases the HOCR to hocr and TXT to txt in the constructed file path.

I didn't write a unit test as I don't have a good linux env to test it in, but 
I was able to put a patched version of the Tika Parser Jar into my Docker Build 
to test it works.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to