Ruairidh Williamson created TIKA-4412: -----------------------------------------
Summary: Tika cannot parse XPS files that are split into "pieces" Key: TIKA-4412 URL: https://issues.apache.org/jira/browse/TIKA-4412 Project: Tika Issue Type: Bug Components: parser Affects Versions: 3.1.0 Reporter: Ruairidh Williamson Attachments: pieces.xps There is a nuance in the XPS spec where "parts" in the container can be split into "pieces". If a document has been split into pieces this breaks tika's parsing because it expects entries at their well known locations. It doesn't check for the pieces files. The naming scheme for the pieces is pretty simple. Instead of the file being `[Content_Types].xml` it will be split into `[Content_Types].xml/[0].piece`, `[Content_Types].xml/[1].piece`, `[Content_Types].xml/[2].piece`, etc... Here is an example document [^pieces.xps] I would like to attempt to provide a pull request to fix this but it may take me a while depending on other factors. >From my understanding pieces are unique to XPS files and aren't allowed for >other office formats. I've taken a look at the POI code that reads the zip >package and wrapping/overriding getEntry to check for these pieces files seems >a good solution. But I don't want to affect non XPS files. Any suggestions are >appreciated. :) -- This message was sent by Atlassian Jira (v8.20.10#820010)