[jira] [Commented] (TIKA-3526) i cant extract content from attachments in the document

Tim Allison (Jira) Tue, 17 Aug 2021 03:23:07 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-3526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400310#comment-17400310
 ]


Tim Allison commented on TIKA-3526:
-----------------------------------

For example, if I read your table correctly, it looks like Tika is not 
extracting an xlsx from a pptx.  However, when I embed an xlsx into a pptx as 
attached, I get this from tika-app:

{noformat}
<?xml version="1.0" encoding="UTF-8"?><html 
xmlns="http://www.w3.org/1999/xhtml";>
<head>
<meta name="date" content="2021-08-17T10:20:03Z"/>
<meta name="cp:revision" content="1"/>
<meta name="extended-properties:AppVersion" content="16.0000"/>
<meta name="meta:paragraph-count" content="1"/>
<meta name="meta:word-count" content="2"/>
<meta name="extended-properties:PresentationFormat" content="Widescreen"/>
<meta name="dc:creator" content="Microsoft Office User"/>
<meta name="extended-properties:Company" content=""/>
<meta name="Word-Count" content="2"/>
<meta name="dcterms:created" content="2021-08-17T10:19:16Z"/>
<meta name="dcterms:modified" content="2021-08-17T10:20:03Z"/>
<meta name="Last-Modified" content="2021-08-17T10:20:03Z"/>
<meta name="Last-Save-Date" content="2021-08-17T10:20:03Z"/>
<meta name="Paragraph-Count" content="1"/>
<meta name="meta:save-date" content="2021-08-17T10:20:03Z"/>
<meta name="dc:title" content="PowerPoint Presentation"/>
<meta name="Application-Name" content="Microsoft Macintosh PowerPoint"/>
<meta name="modified" content="2021-08-17T10:20:03Z"/>
<meta name="Content-Length" content="56524"/>
<meta name="Content-Type" 
content="application/vnd.openxmlformats-officedocument.presentationml.presentation"/>
<meta name="Slide-Count" content="1"/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
<meta name="X-Parsed-By" 
content="org.apache.tika.parser.microsoft.ooxml.OOXMLParser"/>
<meta name="creator" content="Microsoft Office User"/>
<meta name="meta:author" content="Microsoft Office User"/>
<meta name="meta:creation-date" content="2021-08-17T10:19:16Z"/>
<meta name="extended-properties:Application" content="Microsoft Macintosh 
PowerPoint"/>
<meta name="meta:last-author" content="Microsoft Office User"/>
<meta name="meta:slide-count" content="1"/>
<meta name="Creation-Date" content="2021-08-17T10:19:16Z"/>
<meta name="xmpTPg:NPages" content="1"/>
<meta name="resourceName" content="TIKA-3526.pptx"/>
<meta name="Last-Author" content="Microsoft Office User"/>
<meta name="Revision-Number" content="1"/>
<meta name="Application-Version" content="16.0000"/>
<meta name="extended-properties:DocSecurityString" content="None"/>
<meta name="Author" content="Microsoft Office User"/>
<meta name="publisher" content=""/>
<meta name="Presentation-Format" content="Widescreen"/>
<meta name="dc:publisher" content=""/>
<title>PowerPoint Presentation</title>
</head>
<body><div class="slide-content"><p/>
<p>Hello pptx</p>
<div class="embedded" id="slide1_rId3"/>
<div class="embedded" id="slide1_rId3"/>
</div>
<div class="slide-master-content"/>
<div class="page"><p/>

<p>hello xlsx</p>

<p/>

</div>

<p>hello xlsx</p>
<div><h1>Sheet1</h1>
<table><tbody><tr>      <td>hello xlsx</td></tr>
</tbody></table>
</div>
<div class="embedded" id="/docProps/thumbnail.jpeg"/></body></html>
{noformat}

> i cant extract content from attachments in the document
> -------------------------------------------------------
>
>                 Key: TIKA-3526
>                 URL: https://issues.apache.org/jira/browse/TIKA-3526
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.20
>            Reporter: matcha007
>            Priority: Major
>         Attachments: TIKA-3526.pptx
>
>
> office series documents contain office series document attachment. Can the 
> contents of the attachments be extracted as shown in the table below
>  
> | |doc|docx|xls|xlsx|ppt|pptx|
> |txt|(/)|(/)|(/)|(/)|(x)|(/)|
> |pdf|(/)|(/)|(/)|(/)|(x)|(/)|
> |xml|(/)|(/)|(/)|(/)|(x)|(/)|
> |doc|(/)|(/)|(/)|(/)|(x)|(/)|
> |docx|(x)|(/)|(/)|(/)|(x)|(/)|
> |xls|(/)|(/)|(/)|(/)|(x)|(/)|
> |xlsx|(/)|(/)|(x)|(x)|(x)|(x)|
> |ppt|(/)|(/)|(/)|(/)|(x)|(/)|
> |pptx|(/)|(/)|(/)|(/)|(x)|(/)|
>  
>  1.If our use method is wrong, please help us use the correct way
> {code:java}
> File file = new File("XX"); 
> Parser parser = new OfficeParser(); 
>  ParseContext context = new ParseContext();
>  Metadata metadata = new Metadata();
> metadata.set(HttpHeaders.CONTENT_ENCODING, "GB18030");
> metadata.set(TikaMetadataKeys.RESOURCE_NAME_KEY, file.getName());
> parser.parse(inputStream, handler, metadata, context);
> {code}
>  
>  2.We use Tika version: 1.20. Of course, we have replaced the latest version 
> 2.0. This problem still exists.
>   
>  3.If there is indeed this omission in the current version, please help us 
> optimize it in subsequent versions
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3526) i cant extract content from attachments in the document

Reply via email to