[
https://issues.apache.org/jira/browse/TIKA-779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Antoni Mylka updated TIKA-779:
------------------------------
Attachment: tika-779.patch
My workaround + test.
> Detection of Microsoft Works 2000 Word Processor files
> ------------------------------------------------------
>
> Key: TIKA-779
> URL: https://issues.apache.org/jira/browse/TIKA-779
> Project: Tika
> Issue Type: Test
> Affects Versions: 1.0
> Environment: Windows 7, 64 bit
> Reporter: Antoni Mylka
> Attachments: microsoft-works-word-processor-2000.wps, tika-779.patch
>
>
> In older versions of Tika, our Microsoft Works 2000 Word Processor example
> file would get recognized properly by the POIFSContainerDetector. Now it
> isn't. Some debugging revealed that the improvements from TIKA-704 broke the
> detection of that particular file. The detection is based on top-level names
> obtained from the root DirectoryNode. In case of this file there are two
> strings in that set: "CONTENTS" and "\u0001CompObj". In older versions
> "CONTENTS" was enough to recognize a file as "application/vnd.ms-works". Now
> it looks like this:
> {noformat}
> if (names.contains("CONTENTS") && names.contains("SPELLING")) {
> return WPS;
> } else if (names.contains("CONTENTS")) {
> // CONTENTS without SPELLING normally means some sort of
> // embedded non-office file inside an OLE2 document
> // This is most commonly triggered on nested directories
> return OLE;
> }
> {noformat}
> Now I have a file with CONTENTS, but without SPELLING, and it's a normal WPS
> file. I did a workaround like this:
> {noformat}
> if ( names.contains("CONTENTS") &&
> (names.contains("SPELLING") || names.contains("\u0001CompObj"))) {
> return WPS;
> } else if (names.contains("CONTENTS")) {
> // CONTENTS without SPELLING normally means some sort of
> // embedded non-office file inside an OLE2 document
> // This is most commonly triggered on nested directories
> return OLE;
> }
> {noformat}
> So "CONTENTS" has to be supplemented by "SPELLING" or "\u0001CompObj". I
> don't know the meaning of this and I don't know if that second string also
> occurs in those "embedded non-office files inside an OLE2 documents",
> referred to in that comment. The workaround solves the problem for me, the
> Tika build tests pass and regression tests of my apps pass as well.
> Jukka, do you have more than one WPS file, and all of them have both CONTENTS
> and SPELLING names in that collection? Is the "\u0001CompObj" string
> characteristic to this format, or is it a generic thing which also occurs on
> those "non-office files" or "nested directories". If yes, just close this as
> wontfix.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira