[jira] [Comment Edited] (TIKA-3526) i cant extract content from attachments in the document

matcha007 (Jira) Thu, 02 Dec 2021 21:51:08 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-3526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452731#comment-17452731
 ]


matcha007 edited comment on TIKA-3526 at 12/3/21, 5:50 AM:
-----------------------------------------------------------

Finally, the PPT embedded files.
I found that the reason why I can't get the embedded object is that i can't get 
the exembed object when the program executes to 
org.apache.poi.hslf.usermodel.hslfobjectshape.getObjectData.
so, i modified the code.
{code:java}
package org.apache.poi.hslf.usermodel;
public final class HSLFObjectShape extends HSLFPictureShape implements 
ObjectShape<HSLFShape,HSLFTextParagraph> {
    ......
    private ExEmbed getExEmbed(boolean create) {
        if (_exEmbed == null) {
        ......
            int id = getObjectID() + 3;
            for (Record ch : lst.getChildRecords()) {
                    if(ch instanceof ExEmbed){
                      ExEmbed embd = (ExEmbed)ch;
                      if( embd.getExOleObjAtom().getObjStgDataRef() == id) {
                        _exEmbed = embd;
                      }
                    }
                   }         
            ......          
        }
        return _exEmbed;
    }
    ......
} {code}
then，i got the exembed ppt,pptx,doc,docx,xls,xlsx from ppt.
but , i cant got exembed txt,pdf,xml from ppt.
so,i modified the code again.
{code:java}
package org.apache.tika.parser.microsoft;
public class HSLFExtractor extends AbstractPOIFSExtractor {
    ......
    private void handleSlideEmbeddedResources(ShapeContainer shapeContainer, 
XHTMLContentHandler xhtml) throws TikaException, SAXException, IOException {
        ......                            if 
(!mediaType.equals("application/x-tika-msoffice-embedded; format=comp_obj") && 
!mediaType.equals("application/x-tika-msoffice") && 
!mediaType.equals("application/x-tika-msoffice-embedded; format=ole10_native")) 
{
                            this.handleEmbeddedResource(stream, objID, objID, 
mediaType, xhtml, false);
                        }
        ......
    }
    ......
}
{code}
omg,I succeeded.
 


was (Author: matcha007):
Finally, the PPT embedded files.
I found that the reason why I can't get the embedded object is that i can't get 
the exembed object when the program executes to 
org.apache.poi.hslf.usermodel.hslfobjectshape.getObjectData.
so, i modified the code.
{code:java}
package org.apache.poi.hslf.usermodel;
public final class HSLFObjectShape extends HSLFPictureShape implements 
ObjectShape<HSLFShape,HSLFTextParagraph> {
    ......
    private ExEmbed getExEmbed(boolean create) {
        if (_exEmbed == null) {
        ......
            int id = getObjectID();
            int i = 0;
            for (Record ch : lst.getChildRecords()) {
                if(ch instanceof ExEmbed){
                    ExEmbed embd = (ExEmbed)ch;
                    if (i++ == id) {
                        _exEmbed = embd;
                    }                }
            }
        ......          
        }
        return _exEmbed;
    }
    ......
} {code}
then，i got the exembed ppt,pptx,doc,docx,xls,xlsx from ppt.
but , i cant got exembed txt,pdf,xml from ppt.
so,i modified the code again.
{code:java}
package org.apache.tika.parser.microsoft;
public class HSLFExtractor extends AbstractPOIFSExtractor {
    ......
    private void handleSlideEmbeddedResources(ShapeContainer shapeContainer, 
XHTMLContentHandler xhtml) throws TikaException, SAXException, IOException {
        ......                            if 
(!mediaType.equals("application/x-tika-msoffice-embedded; format=comp_obj") && 
!mediaType.equals("application/x-tika-msoffice") && 
!mediaType.equals("application/x-tika-msoffice-embedded; format=ole10_native")) 
{
                            this.handleEmbeddedResource(stream, objID, objID, 
mediaType, xhtml, false);
                        }
        ......
    }
    ......
}
{code}
omg,I succeeded.
 

> i cant extract content from attachments in the document
> -------------------------------------------------------
>
>                 Key: TIKA-3526
>                 URL: https://issues.apache.org/jira/browse/TIKA-3526
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.20
>            Reporter: matcha007
>            Priority: Major
>         Attachments: TIKA-3526.pptx, embedded attachment.doc, embedded 
> attachment.docx, embedded attachment.ppt, embedded attachment.pptx, embedded 
> attachment.xls, embedded attachment.xlsx, image-2021-12-03-11-04-38-478.png, 
> image-2021-12-03-11-05-51-182.png, image-2021-12-03-11-06-44-697.png, 
> image-2021-12-03-11-07-33-659.png, image-2021-12-03-11-11-29-649.png, 
> image-2021-12-03-11-15-51-328.png
>
>
> office series documents contain office series document attachment. Can the 
> contents of the attachments be extracted as shown in the table below
>  
> | |doc|docx|xls|xlsx|ppt|pptx|
> |txt|(/)|(/)|(/)|(/)|(x)|(/)|
> |pdf|(/)|(/)|(/)|(/)|(x)|(/)|
> |xml|(/)|(/)|(/)|(/)|(x)|(/)|
> |doc|(/)|(/)|(/)|(/)|(x)|(/)|
> |docx|(x)|(/)|(/)|(/)|(x)|(/)|
> |xls|(/)|(/)|(/)|(/)|(x)|(/)|
> |xlsx|(/)|(/)|(x)|(x)|(x)|(x)|
> |ppt|(/)|(/)|(/)|(/)|(x)|(/)|
> |pptx|(/)|(/)|(/)|(/)|(x)|(/)|
>  
>  1.If our use method is wrong, please help us use the correct way
> {code:java}
> File file = new File("XX"); 
> Parser parser = new OfficeParser(); 
>  ParseContext context = new ParseContext();
>  Metadata metadata = new Metadata();
> metadata.set(HttpHeaders.CONTENT_ENCODING, "GB18030");
> metadata.set(TikaMetadataKeys.RESOURCE_NAME_KEY, file.getName());
> parser.parse(inputStream, handler, metadata, context);
> {code}
>  
>  2.We use Tika version: 1.20. Of course, we have replaced the latest version 
> 2.0. This problem still exists.
>   
>  3.If there is indeed this omission in the current version, please help us 
> optimize it in subsequent versions
>   



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Comment Edited] (TIKA-3526) i cant extract content from attachments in the document

Reply via email to