Based on https://github.com/apache/tika/blob/main/tika-parser-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java#L518
and https://github.com/apache/tika/blob/main/tika-parser-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/AbstractPOIFSExtractor.java#L159 I _think_ we're handling this... On Tue, Oct 13, 2020 at 10:38 AM Tim Allison <[email protected]> wrote: > Thank you, Nick! > > IIUC the XLSX raw bytes are in the Package entry of an OLE2 wrapper. What > is the key for the OLE2 wrapper in the PPT? Sorry for missing this... > > Have you put your hands on an example that you could share privately? > Happy to look through our regression corpus if I know what exactly to look > for. > > Thank you, again! > > Cheers, > > Tim > > On Sat, Oct 10, 2020 at 7:20 AM Nick Burch <[email protected]> wrote: > >> On Fri, 9 Oct 2020, Tim Allison wrote: >> > Do you think we should follow up on the Tika side? Do we know if we can >> > handle this? >> >> I thought we did, but checking POIFSContainerDetector I can't actually >> see >> that case covered.... >> >> I think we (Tika) can handle it in a similar way to CompObj >> >> > Over on Stackoverflow <https://stackoverflow.com/q/64269294/685641> >> > there's a user who was getting what they thought was an embedded XSLX >> file >> > out of a PPT, but finding it was an OLE2 wrapper with CompObj and >> Package >> > entries. The real XLSX was in the Package part. Passing the outer OLE2 >> > stream to WorkbookFactory didn't work >> >> The list of entries to search for are in the comments on the question. We >> may actually have a similar file in our corpus we can use to test. I >> think >> it is triggered when an OOXML file is embedded in a PPT by some older >> versions of PowerPoint, as a compatibility wrapper >> >> Nick >> >
