Chris Bryant created TIKA-3517:
----------------------------------
Summary: Text extraction doesn't work for Pages and Numbers when
Tesseract is disabled
Key: TIKA-3517
URL: https://issues.apache.org/jira/browse/TIKA-3517
Project: Tika
Issue Type: Bug
Affects Versions: 2.0.0
Environment: I tested this on RHEL7. I got the same results whether I
was using Tesseract 3 or Tesseract 4, but that doesn't really matter because
the problems I'm having are when Tesseract is disabled.
Reporter: Chris Bryant
Attachments: SSN.numbers, SSN.pages, no_ocr.xml
When I try running tika to try to extract text from Mac Pages and Numbers
files, the text extraction does not work if Tesseract is disabled. I'm
attaching sample files, including the config file I use to disable Tesseract.
I get the same results whether I run the server version
(tika-server-standard-2.0.0.jar) or the command line app (tika-app-2.0.0.jar).
The following commands extract text along with what appears to be a list of a
bunch of .iwa files and .jpg files inside the Pages and Numbers files:
java -jar ~/tika-app-2.0.0.jar -t ~/SSN.pages
java -jar ~/tika-app-2.0.0.jar -t ~/SSN.numbers
However, when I run the following commands using the configuration file to
disable Tesseract, all that is extracted is the list of .iwa and .jpg files and
none of the actual text is extracted:
java -jar ~/tika-app-2.0.0.jar --config=no_ocr.xml -t ~/SSN.pages
java -jar ~/tika-app-2.0.0.jar --config=no_ocr.xml -t ~/SSN.numbers
I haven't see similar problems with other types of files I've tested with,
including .docx, pptx, .xlsx, .odt, .ods, .odp, and .pdf. Those work fine with
or without Tesseract disabled.
On a somewhat separate issue, I have been unable to get any text extracted from
my test Keynote file at all, whether Tesseract is enabled or not. I'm having
difficulty uploading that file, so I'll see if I can add that later.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)