Ha, y, this file exercises those bits of code:
https://github.com/apache/tika/blob/main/tika-parser-modules/tika-parser-microsoft-module/src/test/resources/test-documents/testPPT_oleWorkbook.ppt

Nick, does this match the features of the SO question?

On Tue, Oct 13, 2020 at 10:58 AM Tim Allison <[email protected]> wrote:

> Based on
>
>
> https://github.com/apache/tika/blob/main/tika-parser-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java#L518
>
> and
>
>
> https://github.com/apache/tika/blob/main/tika-parser-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/AbstractPOIFSExtractor.java#L159
>
> I _think_ we're handling this...
>
> On Tue, Oct 13, 2020 at 10:38 AM Tim Allison <[email protected]> wrote:
>
>> Thank you, Nick!
>>
>> IIUC the XLSX raw bytes are in the Package entry of an OLE2 wrapper.
>> What is the key for the OLE2 wrapper in the PPT?  Sorry for missing this...
>>
>> Have you put your hands on an example that you could share privately?
>> Happy to look through our regression corpus if I know what exactly to look
>> for.
>>
>> Thank you, again!
>>
>> Cheers,
>>
>>        Tim
>>
>> On Sat, Oct 10, 2020 at 7:20 AM Nick Burch <[email protected]> wrote:
>>
>>> On Fri, 9 Oct 2020, Tim Allison wrote:
>>> > Do you think we should follow up on the Tika side?  Do we know if we
>>> can
>>> > handle this?
>>>
>>> I thought we did, but checking POIFSContainerDetector I can't actually
>>> see
>>> that case covered....
>>>
>>> I think we (Tika) can handle it in a similar way to CompObj
>>>
>>> > Over on Stackoverflow <https://stackoverflow.com/q/64269294/685641>
>>> > there's a user who was getting what they thought was an embedded XSLX
>>> file
>>> > out of a PPT, but finding it was an OLE2 wrapper with CompObj and
>>> Package
>>> > entries. The real XLSX was in the Package part. Passing the outer OLE2
>>> > stream to WorkbookFactory didn't work
>>>
>>> The list of entries to search for are in the comments on the question.
>>> We
>>> may actually have a similar file in our corpus we can use to test. I
>>> think
>>> it is triggered when an OOXML file is embedded in a PPT by some older
>>> versions of PowerPoint, as a compatibility wrapper
>>>
>>> Nick
>>>
>>

Reply via email to