Ross Johnson created TIKA-4091:
----------------------------------
Summary: OLE2 / CFB entry names should be treated
case-insensitively
Key: TIKA-4091
URL: https://issues.apache.org/jira/browse/TIKA-4091
Project: Tika
Issue Type: Bug
Affects Versions: 2.8.0
Reporter: Ross Johnson
Attachments: protected - normal case.docx, protected - upper
case.docx, simple - lower case.doc, simple - normal case.doc, simple - upper
case.doc
According to section [2.6.1 of
MS-CFB|https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-cfb/60fe8611-66c3-496b-b70d-a504c94c9ace],
entries (whether they be "storage" or "stream" nodes) should be located with a
special case-insensitive uppercase mapping. I believe Tika is using a
case-sensitive approach, e.g. when looking for certain OLE2 objects in
POIFSContainerDetector.java. The result is that Tika may perform incomplete or
otherwise subpar type detection on OLE2 files, as well as provide incomplete
metadata & extracted text output.
Attached are some sample documents. The 3 "simple" ones demonstrate incomplete
metadata & text extraction. These 3 files are equivalent except for the casing
of the OLE2 names. Word opens all normally and shows the correct metadata. Tika
output is missing all metadata and document content for the "upper case" and
"lower case" variants.
The two "protected" examples are again equivalent, except for the casing. Tika
gives an EncryptedDocumentException for "protected - normal case.docx" but not
for "protected - upper case.docx". The password for these 2 files is "password".
--
This message was sent by Atlassian Jira
(v8.20.10#820010)