[
https://issues.apache.org/jira/browse/TIKA-2588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Brian McColgan closed TIKA-2588.
--------------------------------
Issue resolved very quickly effectively by the maestro Tika-developer T.A.
Thank you once again, you rock!
> Tika detecting/parsing pptx with embedded Excel worksheet(s)...
> ---------------------------------------------------------------
>
> Key: TIKA-2588
> URL: https://issues.apache.org/jira/browse/TIKA-2588
> Project: Tika
> Issue Type: Bug
> Components: detector, parser
> Affects Versions: 1.17
> Environment:
> Reporter: Brian McColgan
> Assignee: Tim Allison
> Priority: Major
> Fix For: 1.18, 2.0.0
>
> Attachments: foo.out, pptEmbedExcelDoubleClickFromWorkbook.PNG,
> pptEmbedExcelInEmptyWorkbook.PNG, tikaSample.pptx
>
>
> Hello tika-developers,
> First, a big 'thank-you' for creating and maintaining Apache-Tika! A really
> useful capability/service that can be used in so many different ways. You
> folks are the true Debabelizer (h2g2.com).
> On to issue-encountered: using Tika 1.17 to extract an embedded Excel object
> out of a pptx is causing issues. Simple example attached to this Jira-issue
> ([^tikaSample.pptx]) which if run against Tika 1.17 (with
> verbose/list-parsers/list-detectors) provides the output in ([^foo.out]).
> The deck contains a title slide, and a single-slide with embedded Excel
> object on it.
> As noted to [~gagravarr] on S-Overflow, I grabbed the unit-test data which
> you use in your parser/office JUnit suite (test_ppt_embedded_two_slides.pptx)
> and tried opening in Office/PPT 2016. I selected (with mouse) the embedded
> sheet (had Alfresco logo in it) and pasted it into an empty Office/Excel 2016
> workbook. When I tried to interact with it, I had to double-click to make it
> active. As a result, I ended up with two Excel instances on my Windows 10
> desktop (the original object in 1, the Excel worksheet in another). I have
> included a picture of the embedded Excel object pasted into the workbook...
> !pptEmbedExcelInEmptyWorkbook.PNG! ).
> followed by the worksheet opened inside the workbook (required double-click
> within the black-bordered area in the first pic above):
> !pptEmbedExcelDoubleClickFromWorkbook.PNG!
> I managed to extract the embedded object using apache POI. The logic
> sequence was something like the following:
> # Create an XMLSlideShow object, and pull the list of underlying slide
> entities.
> # Walk the list of XSLFSlide(s), searching for a matching slide (by name) -
> e.g. 'MFL'.
> # Examine PackagePart of XSLFSlide (matching name) and for content-type.
> # If pPart.content-type is
> 'application/vnd.openxmlformats-officedocument.oleObject' then - 'candidate
> FOUND'.
> # Build POIFS around the candidate FOUND, extract root of FileSystem.
> # Verify that root has entries for \{ 'Package', '\u0001Ole', and
> '\u0001CompObj' }.
> # Extract entry '\u0001CompObj', verify entry is a DocumentEntry and
> underlying bytes for DocumentNode match an 'Excel' signature.
> # If (step 7 is true) -> extract entry 'Package'.
> # The resulting entry represents the byte-stream of the embedded Excel
> entity.
> I was able to instantiate this into a new workbook (as an example) using POI,
> and when I opened it, the worksheet was correctly embedded in that
> 'example.xlsx'.
> I am not as familiar with Tika, so was a little less comfortable trying to
> walk it through. I thought however, recreating this path would provide
> further insight for you.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)