[jira] [Comment Edited] (TIKA-3517) Text extraction doesn't work for Pages and Numbers when Tesseract is disabled

Tim Allison (Jira) Mon, 09 Aug 2021 13:27:04 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396260#comment-17396260
 ]


Tim Allison edited comment on TIKA-3517 at 8/9/21, 8:26 PM:
------------------------------------------------------------

For posterity, I'm attaching the Document.iwa file from inside the zip, and a 
decompressed Document (decompressed with snzip {{snzip -t iwa -d 
Document.iwa}}).  I'm not able to process the decompressed file with protoc 
{{protoc --decode_raw < Document}} or with the java protobuf library.

{noformat}
        try (InputStream is = 
Files.newInputStream(Paths.get("/home/tallison/Desktop/Document"))) {
            UnknownFieldSet doc =
                    UnknownFieldSet.parseFrom(is);
        }
{noformat}

I get "Protocol message tag had invalid wire type" from the latter and "Failed 
to parse input" from the former.

I also tried decompressing the stream with Commons compress and got the same 
results.

{noformat}
        Path p = Paths.get("/blah/blah/Document.iwa");
        try (InputStream is = new 
FramedSnappyCompressorInputStream(Files.newInputStream(p),
                FramedSnappyDialect.IWORK_ARCHIVE)) {
            UnknownFieldSet doc =
                    UnknownFieldSet.parseFrom(is);
        }
{noformat}

There be dragons. :(


was (Author: [email protected]):
For posterity, I'm attaching the Document.iwa file from inside the zip, and a 
decompressed Document (decompressed with snzip {{snzip -t iwa -d 
Document.iwa}}).  I'm not able to process the decompressed file with protoc 
{{protoc --decode_raw < Document}} or with the java protobuf library.

{noformat}
        try (InputStream is = 
Files.newInputStream(Paths.get("/home/tallison/Desktop/Document"))) {
            UnknownFieldSet doc =
                    UnknownFieldSet.parseFrom(is);
        }
{noformat}

I get "Protocol message tag had invalid wire type" from the latter and "Failed 
to parse input" from the former.

I also tried decompressing the stream with Commons compress and got the same 
results.

{noformat}
        Path p = Paths.get("/home/tallison/Desktop/Document.iwa");
        try (InputStream is = new 
FramedSnappyCompressorInputStream(Files.newInputStream(p),
                FramedSnappyDialect.IWORK_ARCHIVE)) {
            UnknownFieldSet doc =
                    UnknownFieldSet.parseFrom(is);
        }
{noformat}

There be dragons. :(

> Text extraction doesn't work for Pages and Numbers when Tesseract is disabled
> -----------------------------------------------------------------------------
>
>                 Key: TIKA-3517
>                 URL: https://issues.apache.org/jira/browse/TIKA-3517
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 2.0.0
>         Environment: I tested this on RHEL7.  I got the same results whether 
> I was using Tesseract 3 or Tesseract 4, but that doesn't really matter 
> because the problems I'm having are when Tesseract is disabled.
>            Reporter: Chris Bryant
>            Priority: Major
>         Attachments: Document, Document.iwa, SSN.numbers, SSN.pages, 
> no_ocr.xml
>
>
> When I try running tika to try to extract text from Mac Pages and Numbers 
> files, the text extraction does not work if Tesseract is disabled.  I'm 
> attaching sample files, including the config file I use to disable Tesseract. 
>  I get the same results whether I run the server version 
> (tika-server-standard-2.0.0.jar) or the command line app 
> (tika-app-2.0.0.jar).  
> The following commands extract text along with what appears to be a list of a 
> bunch of .iwa files and .jpg files inside the Pages and Numbers files:
> java -jar ~/tika-app-2.0.0.jar -t ~/SSN.pages
> java -jar ~/tika-app-2.0.0.jar -t ~/SSN.numbers
> However, when I run the following commands using the configuration file to 
> disable Tesseract, all that is extracted is the list of .iwa and .jpg files 
> and none of the actual text is extracted:
> java -jar ~/tika-app-2.0.0.jar --config=no_ocr.xml -t ~/SSN.pages
> java -jar ~/tika-app-2.0.0.jar --config=no_ocr.xml -t ~/SSN.numbers
>  
> I haven't see similar problems with other types of files I've tested with, 
> including .docx, pptx, .xlsx, .odt, .ods, .odp, and .pdf.  Those work fine 
> with or without Tesseract disabled.
>  
> On a somewhat separate issue, I have been unable to get any text extracted 
> from my test Keynote file at all, whether Tesseract is enabled or not.  I'm 
> having difficulty uploading that file, so I'll see if I can add that later.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (TIKA-3517) Text extraction doesn't work for Pages and Numbers when Tesseract is disabled

Reply via email to