Ruairidh Williamson created TIKA-4412:
-----------------------------------------

             Summary: Tika cannot parse XPS files that are split into "pieces"
                 Key: TIKA-4412
                 URL: https://issues.apache.org/jira/browse/TIKA-4412
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 3.1.0
            Reporter: Ruairidh Williamson
         Attachments: pieces.xps

There is a nuance in the XPS spec where "parts" in the container can be split 
into "pieces". If a document has been split into pieces this breaks tika's 
parsing because it expects entries at their well known locations. It doesn't 
check for the pieces files.

The naming scheme for the pieces is pretty simple. Instead of the file being 
`[Content_Types].xml` it will be split into `[Content_Types].xml/[0].piece`, 
`[Content_Types].xml/[1].piece`,  `[Content_Types].xml/[2].piece`, etc...

Here is an example document [^pieces.xps]

I would like to attempt to provide a pull request to fix this but it may take 
me a while depending on other factors.

>From my understanding pieces are unique to XPS files and aren't allowed for 
>other office formats. I've taken a look at the POI code that reads the zip 
>package and wrapping/overriding getEntry to check for these pieces files seems 
>a good solution. But I don't want to affect non XPS files. Any suggestions are 
>appreciated. :)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to