[jira] [Comment Edited] (TIKA-3526) i cant extract content from attachments in the document

Tim Allison (Jira) Fri, 03 Dec 2021 15:29:08 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-3526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453247#comment-17453247
 ]


Tim Allison edited comment on TIKA-3526 at 12/3/21, 11:28 PM:
--------------------------------------------------------------

K.  As something of a hack, but something I wanted to do a while ago, the ppt 
extractor now runs through the slideshow level attachments at the end of 
processing, and if it hasn't processed attachments as parts of shapes within 
slides, it processes them at the end.  This may capture some attachment types 
that we had not processed before.  It definitely fixes the issue with these WPS 
generated ppt.

FTR, I was not able to extract many of the attachments via MSOffice products 
nor LibreOffice. 

I can't thank you enough [~matcha007] for creating the test files and sharing 
them, also, obv, for digging into the root causes. 

I committed and pushed an update now.  I'll run regression tests against 
msoffice files in our large regression corpus.  If there are surprises, we can 
rollback or modify as necessary.

[~matcha007], please give the latest build a try and let us know if there are 
any other critical items with WPS generated files.


was (Author: [email protected]):
K.  As something of a hack, but something I wanted to do a while ago, the ppt 
extractor now runs through the slideshow level attachments at the end of 
processing, and if it hasn't processed attachments as parts of shapes within 
slides, it processes them at the end.  This may capture some attachment types 
that we had not processed before.  It definitely fixes the issue with these WPS 
files.

FTR, I was not able to extract many of the attachments via MSOffice products 
nor LibreOffice. 

I can't thank you enough [~matcha007] for creating the test files and sharing 
them, also, obv, for digging into the root causes. 

I committed and pushed an update now.  I'll run regression tests against 
msoffice files in our large regression corpus.  If there are surprises, we can 
rollback or modify as necessary.

[~matcha007], please give the latest build a try and let us know if there are 
any other critical items with WPS generated files.

> i cant extract content from attachments in the document
> -------------------------------------------------------
>
>                 Key: TIKA-3526
>                 URL: https://issues.apache.org/jira/browse/TIKA-3526
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.20
>            Reporter: matcha007
>            Priority: Major
>         Attachments: TIKA-3526.pptx, embedded attachment.doc, embedded 
> attachment.docx, embedded attachment.ppt, embedded attachment.pptx, embedded 
> attachment.xls, embedded attachment.xlsx, image-2021-12-03-11-04-38-478.png, 
> image-2021-12-03-11-05-51-182.png, image-2021-12-03-11-06-44-697.png, 
> image-2021-12-03-11-07-33-659.png, image-2021-12-03-11-11-29-649.png, 
> image-2021-12-03-11-15-51-328.png
>
>
> office series documents contain office series document attachment. Can the 
> contents of the attachments be extracted as shown in the table below
>  
> | |doc|docx|xls|xlsx|ppt|pptx|
> |txt|(/)|(/)|(/)|(/)|(x)|(/)|
> |pdf|(/)|(/)|(/)|(/)|(x)|(/)|
> |xml|(/)|(/)|(/)|(/)|(x)|(/)|
> |doc|(/)|(/)|(/)|(/)|(x)|(/)|
> |docx|(x)|(/)|(/)|(/)|(x)|(/)|
> |xls|(/)|(/)|(/)|(/)|(x)|(/)|
> |xlsx|(/)|(/)|(x)|(x)|(x)|(x)|
> |ppt|(/)|(/)|(/)|(/)|(x)|(/)|
> |pptx|(/)|(/)|(/)|(/)|(x)|(/)|
>  
>  1.If our use method is wrong, please help us use the correct way
> {code:java}
> File file = new File("XX"); 
> Parser parser = new OfficeParser(); 
>  ParseContext context = new ParseContext();
>  Metadata metadata = new Metadata();
> metadata.set(HttpHeaders.CONTENT_ENCODING, "GB18030");
> metadata.set(TikaMetadataKeys.RESOURCE_NAME_KEY, file.getName());
> parser.parse(inputStream, handler, metadata, context);
> {code}
>  
>  2.We use Tika version: 1.20. Of course, we have replaced the latest version 
> 2.0. This problem still exists.
>   
>  3.If there is indeed this omission in the current version, please help us 
> optimize it in subsequent versions
>   



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Comment Edited] (TIKA-3526) i cant extract content from attachments in the document

Reply via email to