[jira] [Updated] (TIKA-3517) Text extraction doesn't work for Pages and Numbers when Tesseract is disabled

Tim Allison (Jira) Mon, 09 Aug 2021 13:17:06 -0700


     [ 
https://issues.apache.org/jira/browse/TIKA-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tim Allison updated TIKA-3517:
------------------------------
    Attachment: Document
                Document.iwa

> Text extraction doesn't work for Pages and Numbers when Tesseract is disabled
> -----------------------------------------------------------------------------
>
>                 Key: TIKA-3517
>                 URL: https://issues.apache.org/jira/browse/TIKA-3517
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 2.0.0
>         Environment: I tested this on RHEL7.  I got the same results whether 
> I was using Tesseract 3 or Tesseract 4, but that doesn't really matter 
> because the problems I'm having are when Tesseract is disabled.
>            Reporter: Chris Bryant
>            Priority: Major
>         Attachments: Document, Document.iwa, SSN.numbers, SSN.pages, 
> no_ocr.xml
>
>
> When I try running tika to try to extract text from Mac Pages and Numbers 
> files, the text extraction does not work if Tesseract is disabled.  I'm 
> attaching sample files, including the config file I use to disable Tesseract. 
>  I get the same results whether I run the server version 
> (tika-server-standard-2.0.0.jar) or the command line app 
> (tika-app-2.0.0.jar).  
> The following commands extract text along with what appears to be a list of a 
> bunch of .iwa files and .jpg files inside the Pages and Numbers files:
> java -jar ~/tika-app-2.0.0.jar -t ~/SSN.pages
> java -jar ~/tika-app-2.0.0.jar -t ~/SSN.numbers
> However, when I run the following commands using the configuration file to 
> disable Tesseract, all that is extracted is the list of .iwa and .jpg files 
> and none of the actual text is extracted:
> java -jar ~/tika-app-2.0.0.jar --config=no_ocr.xml -t ~/SSN.pages
> java -jar ~/tika-app-2.0.0.jar --config=no_ocr.xml -t ~/SSN.numbers
>  
> I haven't see similar problems with other types of files I've tested with, 
> including .docx, pptx, .xlsx, .odt, .ods, .odp, and .pdf.  Those work fine 
> with or without Tesseract disabled.
>  
> On a somewhat separate issue, I have been unable to get any text extracted 
> from my test Keynote file at all, whether Tesseract is enabled or not.  I'm 
> having difficulty uploading that file, so I'll see if I can add that later.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (TIKA-3517) Text extraction doesn't work for Pages and Numbers when Tesseract is disabled

Reply via email to