[ 
https://issues.apache.org/jira/browse/TIKA-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pascal Essiembre updated TIKA-1946:
-----------------------------------
    Attachment: TIKA-1946-pascal.essiembre-01.patch

I created a patch that will now throw a TikaException whenever a bunch of 
checks are not met.

Also, thanks to the batch of WP files you shared and your latest spreadsheet 
with versions identified, I was able to compare the first 32 bytes of a bunch 
of files of different versions.   I found out they all start with FF 57 50 43, 
but the rest varies, except for versions 5.x files.   They all have the same 
signature at byte 17 to 22, which is FB FF 05 00 32 00.  So I also added code 
to check for this signature and throw an exception if encountered, stating 
versions older than 6.0 are not yet supported.  That should be tested against 
your corpus to validate my findings are correct.

Feel free to accept/reject or modify as you wish.

> Add mime detection and parser for WordPerfect
> ---------------------------------------------
>
>                 Key: TIKA-1946
>                 URL: https://issues.apache.org/jira/browse/TIKA-1946
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime, parser
>            Reporter: Nick C
>             Fix For: 2.0, 1.15
>
>         Attachments: TIKA-1946-pascal.essiembre-01.patch, 
> wordperfect_mimes_fuller.zip
>
>
> I noticed some code on github for parsing WordPerfect files 
> (https://github.com/Norconex/importer) Also looks like the author 
> [~pascal.essiembre] has contributed to Tika before



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to