Based on

https://github.com/apache/tika/blob/main/tika-parser-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java#L518

and

https://github.com/apache/tika/blob/main/tika-parser-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/AbstractPOIFSExtractor.java#L159

I _think_ we're handling this...

On Tue, Oct 13, 2020 at 10:38 AM Tim Allison <[email protected]> wrote:

> Thank you, Nick!
>
> IIUC the XLSX raw bytes are in the Package entry of an OLE2 wrapper.  What
> is the key for the OLE2 wrapper in the PPT?  Sorry for missing this...
>
> Have you put your hands on an example that you could share privately?
> Happy to look through our regression corpus if I know what exactly to look
> for.
>
> Thank you, again!
>
> Cheers,
>
>        Tim
>
> On Sat, Oct 10, 2020 at 7:20 AM Nick Burch <[email protected]> wrote:
>
>> On Fri, 9 Oct 2020, Tim Allison wrote:
>> > Do you think we should follow up on the Tika side?  Do we know if we can
>> > handle this?
>>
>> I thought we did, but checking POIFSContainerDetector I can't actually
>> see
>> that case covered....
>>
>> I think we (Tika) can handle it in a similar way to CompObj
>>
>> > Over on Stackoverflow <https://stackoverflow.com/q/64269294/685641>
>> > there's a user who was getting what they thought was an embedded XSLX
>> file
>> > out of a PPT, but finding it was an OLE2 wrapper with CompObj and
>> Package
>> > entries. The real XLSX was in the Package part. Passing the outer OLE2
>> > stream to WorkbookFactory didn't work
>>
>> The list of entries to search for are in the comments on the question. We
>> may actually have a similar file in our corpus we can use to test. I
>> think
>> it is triggered when an OOXML file is embedded in a PPT by some older
>> versions of PowerPoint, as a compatibility wrapper
>>
>> Nick
>>
>

Reply via email to