Ha, y, this file exercises those bits of code: https://github.com/apache/tika/blob/main/tika-parser-modules/tika-parser-microsoft-module/src/test/resources/test-documents/testPPT_oleWorkbook.ppt
Nick, does this match the features of the SO question? On Tue, Oct 13, 2020 at 10:58 AM Tim Allison <[email protected]> wrote: > Based on > > > https://github.com/apache/tika/blob/main/tika-parser-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java#L518 > > and > > > https://github.com/apache/tika/blob/main/tika-parser-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/AbstractPOIFSExtractor.java#L159 > > I _think_ we're handling this... > > On Tue, Oct 13, 2020 at 10:38 AM Tim Allison <[email protected]> wrote: > >> Thank you, Nick! >> >> IIUC the XLSX raw bytes are in the Package entry of an OLE2 wrapper. >> What is the key for the OLE2 wrapper in the PPT? Sorry for missing this... >> >> Have you put your hands on an example that you could share privately? >> Happy to look through our regression corpus if I know what exactly to look >> for. >> >> Thank you, again! >> >> Cheers, >> >> Tim >> >> On Sat, Oct 10, 2020 at 7:20 AM Nick Burch <[email protected]> wrote: >> >>> On Fri, 9 Oct 2020, Tim Allison wrote: >>> > Do you think we should follow up on the Tika side? Do we know if we >>> can >>> > handle this? >>> >>> I thought we did, but checking POIFSContainerDetector I can't actually >>> see >>> that case covered.... >>> >>> I think we (Tika) can handle it in a similar way to CompObj >>> >>> > Over on Stackoverflow <https://stackoverflow.com/q/64269294/685641> >>> > there's a user who was getting what they thought was an embedded XSLX >>> file >>> > out of a PPT, but finding it was an OLE2 wrapper with CompObj and >>> Package >>> > entries. The real XLSX was in the Package part. Passing the outer OLE2 >>> > stream to WorkbookFactory didn't work >>> >>> The list of entries to search for are in the comments on the question. >>> We >>> may actually have a similar file in our corpus we can use to test. I >>> think >>> it is triggered when an OOXML file is embedded in a PPT by some older >>> versions of PowerPoint, as a compatibility wrapper >>> >>> Nick >>> >>
