[jira] [Commented] (TIKA-3526) i cant extract content from attachments in the document

Tim Allison (Jira) Fri, 03 Dec 2021 08:56:46 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-3526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453140#comment-17453140
 ]


Tim Allison commented on TIKA-3526:
-----------------------------------

I think you've solved a number of problems, but not all.  This will take some 
time.  The xlsx file inside the docx file is in oleObject7.bin, which has an 
Ole entry but no CompObj entry (which is what your code above and the original 
code is checking for).

The other issue is that the streaming OPCPackageDetector is short-cutting to 
tika-ooxml instead of waiting to read the {{[Content_Types].xml}}.  This is 
stored as the first physical object by MSWord, etc, but appears last in the 
stream for WPS created files. :(  We'll have to modify this, too.

> i cant extract content from attachments in the document
> -------------------------------------------------------
>
>                 Key: TIKA-3526
>                 URL: https://issues.apache.org/jira/browse/TIKA-3526
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.20
>            Reporter: matcha007
>            Priority: Major
>         Attachments: TIKA-3526.pptx, embedded attachment.doc, embedded 
> attachment.docx, embedded attachment.ppt, embedded attachment.pptx, embedded 
> attachment.xls, embedded attachment.xlsx, image-2021-12-03-11-04-38-478.png, 
> image-2021-12-03-11-05-51-182.png, image-2021-12-03-11-06-44-697.png, 
> image-2021-12-03-11-07-33-659.png, image-2021-12-03-11-11-29-649.png, 
> image-2021-12-03-11-15-51-328.png
>
>
> office series documents contain office series document attachment. Can the 
> contents of the attachments be extracted as shown in the table below
>  
> | |doc|docx|xls|xlsx|ppt|pptx|
> |txt|(/)|(/)|(/)|(/)|(x)|(/)|
> |pdf|(/)|(/)|(/)|(/)|(x)|(/)|
> |xml|(/)|(/)|(/)|(/)|(x)|(/)|
> |doc|(/)|(/)|(/)|(/)|(x)|(/)|
> |docx|(x)|(/)|(/)|(/)|(x)|(/)|
> |xls|(/)|(/)|(/)|(/)|(x)|(/)|
> |xlsx|(/)|(/)|(x)|(x)|(x)|(x)|
> |ppt|(/)|(/)|(/)|(/)|(x)|(/)|
> |pptx|(/)|(/)|(/)|(/)|(x)|(/)|
>  
>  1.If our use method is wrong, please help us use the correct way
> {code:java}
> File file = new File("XX"); 
> Parser parser = new OfficeParser(); 
>  ParseContext context = new ParseContext();
>  Metadata metadata = new Metadata();
> metadata.set(HttpHeaders.CONTENT_ENCODING, "GB18030");
> metadata.set(TikaMetadataKeys.RESOURCE_NAME_KEY, file.getName());
> parser.parse(inputStream, handler, metadata, context);
> {code}
>  
>  2.We use Tika version: 1.20. Of course, we have replaced the latest version 
> 2.0. This problem still exists.
>   
>  3.If there is indeed this omission in the current version, please help us 
> optimize it in subsequent versions
>   



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (TIKA-3526) i cant extract content from attachments in the document

Reply via email to