[
https://issues.apache.org/jira/browse/TIKA-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Pascal Essiembre updated TIKA-1946:
-----------------------------------
Attachment: TIKA-1946-pascal.essiembre-01.patch
I created a patch that will now throw a TikaException whenever a bunch of
checks are not met.
Also, thanks to the batch of WP files you shared and your latest spreadsheet
with versions identified, I was able to compare the first 32 bytes of a bunch
of files of different versions. I found out they all start with FF 57 50 43,
but the rest varies, except for versions 5.x files. They all have the same
signature at byte 17 to 22, which is FB FF 05 00 32 00. So I also added code
to check for this signature and throw an exception if encountered, stating
versions older than 6.0 are not yet supported. That should be tested against
your corpus to validate my findings are correct.
Feel free to accept/reject or modify as you wish.
> Add mime detection and parser for WordPerfect
> ---------------------------------------------
>
> Key: TIKA-1946
> URL: https://issues.apache.org/jira/browse/TIKA-1946
> Project: Tika
> Issue Type: Improvement
> Components: mime, parser
> Reporter: Nick C
> Fix For: 2.0, 1.15
>
> Attachments: TIKA-1946-pascal.essiembre-01.patch,
> wordperfect_mimes_fuller.zip
>
>
> I noticed some code on github for parsing WordPerfect files
> (https://github.com/Norconex/importer) Also looks like the author
> [~pascal.essiembre] has contributed to Tika before
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)