[ 
https://issues.apache.org/jira/browse/TIKA-4091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-4091:
------------------------------
    Priority: Blocker  (was: Major)

> OLE2 / CFB entry names should be treated case-insensitively
> -----------------------------------------------------------
>
>                 Key: TIKA-4091
>                 URL: https://issues.apache.org/jira/browse/TIKA-4091
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 2.8.0
>            Reporter: Ross Johnson
>            Priority: Blocker
>         Attachments: protected - normal case.docx, protected - upper 
> case.docx, simple - lower case.doc, simple - normal case.doc, simple - upper 
> case.doc
>
>
> According to section [2.6.1 of 
> MS-CFB|https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-cfb/60fe8611-66c3-496b-b70d-a504c94c9ace],
>  entries (whether they be "storage" or "stream" nodes) should be located with 
> a special case-insensitive uppercase mapping. I believe Tika is using a 
> case-sensitive approach, e.g. when looking for certain OLE2 objects in 
> POIFSContainerDetector.java. The result is that Tika may perform incomplete 
> or otherwise subpar type detection on OLE2 files, as well as provide 
> incomplete metadata & extracted text output.
> Attached are some sample documents. The 3 "simple" ones demonstrate 
> incomplete metadata & text extraction. These 3 files are equivalent except 
> for the casing of the OLE2 names. Word opens all normally and shows the 
> correct metadata. Tika output is missing all metadata and document content 
> for the "upper case" and "lower case" variants.
> The two "protected" examples are again equivalent, except for the casing. 
> Tika gives an EncryptedDocumentException for "protected - normal case.docx" 
> but not for "protected - upper case.docx". The password for these 2 files is 
> "password".



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to