[
https://issues.apache.org/jira/browse/TIKA-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison updated TIKA-3517:
------------------------------
Attachment: Document
Document.iwa
> Text extraction doesn't work for Pages and Numbers when Tesseract is disabled
> -----------------------------------------------------------------------------
>
> Key: TIKA-3517
> URL: https://issues.apache.org/jira/browse/TIKA-3517
> Project: Tika
> Issue Type: Bug
> Affects Versions: 2.0.0
> Environment: I tested this on RHEL7. I got the same results whether
> I was using Tesseract 3 or Tesseract 4, but that doesn't really matter
> because the problems I'm having are when Tesseract is disabled.
> Reporter: Chris Bryant
> Priority: Major
> Attachments: Document, Document.iwa, SSN.numbers, SSN.pages,
> no_ocr.xml
>
>
> When I try running tika to try to extract text from Mac Pages and Numbers
> files, the text extraction does not work if Tesseract is disabled. I'm
> attaching sample files, including the config file I use to disable Tesseract.
> I get the same results whether I run the server version
> (tika-server-standard-2.0.0.jar) or the command line app
> (tika-app-2.0.0.jar).
> The following commands extract text along with what appears to be a list of a
> bunch of .iwa files and .jpg files inside the Pages and Numbers files:
> java -jar ~/tika-app-2.0.0.jar -t ~/SSN.pages
> java -jar ~/tika-app-2.0.0.jar -t ~/SSN.numbers
> However, when I run the following commands using the configuration file to
> disable Tesseract, all that is extracted is the list of .iwa and .jpg files
> and none of the actual text is extracted:
> java -jar ~/tika-app-2.0.0.jar --config=no_ocr.xml -t ~/SSN.pages
> java -jar ~/tika-app-2.0.0.jar --config=no_ocr.xml -t ~/SSN.numbers
>
> I haven't see similar problems with other types of files I've tested with,
> including .docx, pptx, .xlsx, .odt, .ods, .odp, and .pdf. Those work fine
> with or without Tesseract disabled.
>
> On a somewhat separate issue, I have been unable to get any text extracted
> from my test Keynote file at all, whether Tesseract is enabled or not. I'm
> having difficulty uploading that file, so I'll see if I can add that later.
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)