[
https://issues.apache.org/jira/browse/TIKA-3526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453247#comment-17453247
]
Tim Allison edited comment on TIKA-3526 at 12/3/21, 11:28 PM:
--------------------------------------------------------------
K. As something of a hack, but something I wanted to do a while ago, the ppt
extractor now runs through the slideshow level attachments at the end of
processing, and if it hasn't processed attachments as parts of shapes within
slides, it processes them at the end. This may capture some attachment types
that we had not processed before. It definitely fixes the issue with these WPS
generated ppt.
FTR, I was not able to extract many of the attachments via MSOffice products
nor LibreOffice.
I can't thank you enough [~matcha007] for creating the test files and sharing
them, also, obv, for digging into the root causes.
I committed and pushed an update now. I'll run regression tests against
msoffice files in our large regression corpus. If there are surprises, we can
rollback or modify as necessary.
[~matcha007], please give the latest build a try and let us know if there are
any other critical items with WPS generated files.
was (Author: [email protected]):
K. As something of a hack, but something I wanted to do a while ago, the ppt
extractor now runs through the slideshow level attachments at the end of
processing, and if it hasn't processed attachments as parts of shapes within
slides, it processes them at the end. This may capture some attachment types
that we had not processed before. It definitely fixes the issue with these WPS
files.
FTR, I was not able to extract many of the attachments via MSOffice products
nor LibreOffice.
I can't thank you enough [~matcha007] for creating the test files and sharing
them, also, obv, for digging into the root causes.
I committed and pushed an update now. I'll run regression tests against
msoffice files in our large regression corpus. If there are surprises, we can
rollback or modify as necessary.
[~matcha007], please give the latest build a try and let us know if there are
any other critical items with WPS generated files.
> i cant extract content from attachments in the document
> -------------------------------------------------------
>
> Key: TIKA-3526
> URL: https://issues.apache.org/jira/browse/TIKA-3526
> Project: Tika
> Issue Type: Bug
> Affects Versions: 1.20
> Reporter: matcha007
> Priority: Major
> Attachments: TIKA-3526.pptx, embedded attachment.doc, embedded
> attachment.docx, embedded attachment.ppt, embedded attachment.pptx, embedded
> attachment.xls, embedded attachment.xlsx, image-2021-12-03-11-04-38-478.png,
> image-2021-12-03-11-05-51-182.png, image-2021-12-03-11-06-44-697.png,
> image-2021-12-03-11-07-33-659.png, image-2021-12-03-11-11-29-649.png,
> image-2021-12-03-11-15-51-328.png
>
>
> office series documents contain office series document attachment. Can the
> contents of the attachments be extracted as shown in the table below
>
> | |doc|docx|xls|xlsx|ppt|pptx|
> |txt|(/)|(/)|(/)|(/)|(x)|(/)|
> |pdf|(/)|(/)|(/)|(/)|(x)|(/)|
> |xml|(/)|(/)|(/)|(/)|(x)|(/)|
> |doc|(/)|(/)|(/)|(/)|(x)|(/)|
> |docx|(x)|(/)|(/)|(/)|(x)|(/)|
> |xls|(/)|(/)|(/)|(/)|(x)|(/)|
> |xlsx|(/)|(/)|(x)|(x)|(x)|(x)|
> |ppt|(/)|(/)|(/)|(/)|(x)|(/)|
> |pptx|(/)|(/)|(/)|(/)|(x)|(/)|
>
> 1.If our use method is wrong, please help us use the correct way
> {code:java}
> File file = new File("XX");
> Parser parser = new OfficeParser();
> ParseContext context = new ParseContext();
> Metadata metadata = new Metadata();
> metadata.set(HttpHeaders.CONTENT_ENCODING, "GB18030");
> metadata.set(TikaMetadataKeys.RESOURCE_NAME_KEY, file.getName());
> parser.parse(inputStream, handler, metadata, context);
> {code}
>
> 2.We use Tika version: 1.20. Of course, we have replaced the latest version
> 2.0. This problem still exists.
>
> 3.If there is indeed this omission in the current version, please help us
> optimize it in subsequent versions
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)