[ 
https://issues.apache.org/jira/browse/TIKA-4464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18013717#comment-18013717
 ] 

Tim Allison commented on TIKA-4464:
-----------------------------------

Thank you for opening this. The comment was a placeholder for "there be 
dragons". I didn't have the resources at the time to carry out the work, and 
I'm not sure I have the time available now to do it.

I think commons-compress added a variant of snappy stream processing just for 
us and our potential handling of iworks files.

If you're able to open a PR or if another dev has time to look into this, I 
think that's going to be the only way forward.

If there's some other way to detect the file type based on xml/json or even 
package entry names – basically anything easier than dealing with snappy 
compressed protobufs, that would be preferable, but I realize that just might 
not be possible.

 

> Parsing IWork files results in unknown mimetype
> -----------------------------------------------
>
>                 Key: TIKA-4464
>                 URL: https://issues.apache.org/jira/browse/TIKA-4464
>             Project: Tika
>          Issue Type: Bug
>          Components: detector, parser
>    Affects Versions: 3.2.1
>            Reporter: Gregor Lang
>            Priority: Minor
>         Attachments: sample-2.pages, sample.key, sample.numbers, sample.pages
>
>
> When parsing *.pages or *.numbers files the resulting mime-type is always 
> "application/vnd.apple.unknown.13"
>  
> There seems to be a todo in *IWork13PackageParser* at line 319, which is 
> probably related.
> {code:java}
> // Is it the main document?
> if (name.equals(IWORK13_MAIN_ENTRY)) {
>     // TODO Decode the snappy stream, and check for the Message Type
>     // =     2 (TN::SheetArchive), it is a numbers file;
>     // = 10000 (TP::DocumentArchive), that's a pages file
>     return null;
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to