[ https://issues.apache.org/jira/browse/TIKA-4091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tim Allison updated TIKA-4091: ------------------------------ Priority: Blocker (was: Major) > OLE2 / CFB entry names should be treated case-insensitively > ----------------------------------------------------------- > > Key: TIKA-4091 > URL: https://issues.apache.org/jira/browse/TIKA-4091 > Project: Tika > Issue Type: Bug > Affects Versions: 2.8.0 > Reporter: Ross Johnson > Priority: Blocker > Attachments: protected - normal case.docx, protected - upper > case.docx, simple - lower case.doc, simple - normal case.doc, simple - upper > case.doc > > > According to section [2.6.1 of > MS-CFB|https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-cfb/60fe8611-66c3-496b-b70d-a504c94c9ace], > entries (whether they be "storage" or "stream" nodes) should be located with > a special case-insensitive uppercase mapping. I believe Tika is using a > case-sensitive approach, e.g. when looking for certain OLE2 objects in > POIFSContainerDetector.java. The result is that Tika may perform incomplete > or otherwise subpar type detection on OLE2 files, as well as provide > incomplete metadata & extracted text output. > Attached are some sample documents. The 3 "simple" ones demonstrate > incomplete metadata & text extraction. These 3 files are equivalent except > for the casing of the OLE2 names. Word opens all normally and shows the > correct metadata. Tika output is missing all metadata and document content > for the "upper case" and "lower case" variants. > The two "protected" examples are again equivalent, except for the casing. > Tika gives an EncryptedDocumentException for "protected - normal case.docx" > but not for "protected - upper case.docx". The password for these 2 files is > "password". -- This message was sent by Atlassian Jira (v8.20.10#820010)