Ross Johnson created TIKA-4091:
----------------------------------

             Summary: OLE2 / CFB entry names should be treated 
case-insensitively
                 Key: TIKA-4091
                 URL: https://issues.apache.org/jira/browse/TIKA-4091
             Project: Tika
          Issue Type: Bug
    Affects Versions: 2.8.0
            Reporter: Ross Johnson
         Attachments: protected - normal case.docx, protected - upper 
case.docx, simple - lower case.doc, simple - normal case.doc, simple - upper 
case.doc

According to section [2.6.1 of 
MS-CFB|https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-cfb/60fe8611-66c3-496b-b70d-a504c94c9ace],
 entries (whether they be "storage" or "stream" nodes) should be located with a 
special case-insensitive uppercase mapping. I believe Tika is using a 
case-sensitive approach, e.g. when looking for certain OLE2 objects in 
POIFSContainerDetector.java. The result is that Tika may perform incomplete or 
otherwise subpar type detection on OLE2 files, as well as provide incomplete 
metadata & extracted text output.

Attached are some sample documents. The 3 "simple" ones demonstrate incomplete 
metadata & text extraction. These 3 files are equivalent except for the casing 
of the OLE2 names. Word opens all normally and shows the correct metadata. Tika 
output is missing all metadata and document content for the "upper case" and 
"lower case" variants.

The two "protected" examples are again equivalent, except for the casing. Tika 
gives an EncryptedDocumentException for "protected - normal case.docx" but not 
for "protected - upper case.docx". The password for these 2 files is "password".



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to