[jira] [Updated] (TIKA-3526) i cant extract content from attachments in the document

matcha007 (Jira) Tue, 17 Aug 2021 02:08:05 -0700


     [ 
https://issues.apache.org/jira/browse/TIKA-3526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


matcha007 updated TIKA-3526:
----------------------------
    Description: 
office series documents contain office series document attachment. Can the 
contents of the attachments be extracted as shown in the table below

 
| |doc|docx|xls|xlsx|ppt|pptx|
|txt|(/)|(/)|(/)|(/)|(x)|(/)|
|pdf|(/)|(/)|(/)|(/)|(x)|(/)|
|xml|(/)|(/)|(/)|(/)|(x)|(/)|
|doc|(/)|(/)|(/)|(/)|(x)|(/)|
|docx|(x)|(/)|(/)|(/)|(x)|(/)|
|xls|(/)|(/)|(/)|(/)|(x)|(/)|
|xlsx|(/)|(/)|(x)|(x)|(x)|(x)|
|ppt|(/)|(/)|(/)|(/)|(x)|(/)|
|pptx|(/)|(/)|(/)|(/)|(x)|(/)|

 
 1.If our use method is wrong, please help us use the correct way
{code:java}
File file = new File("XX"); 
Parser parser = new OfficeParser(); 
 ParseContext context = new ParseContext();
 Metadata metadata = new Metadata();

metadata.set(HttpHeaders.CONTENT_ENCODING, "GB18030");
metadata.set(TikaMetadataKeys.RESOURCE_NAME_KEY, file.getName());
parser.parse(inputStream, handler, metadata, context);
{code}
 
 2.We use Tika version: 1.20. Of course, we have replaced the latest version 
2.0. This problem still exists.
  
 3.If there is indeed this omission in the current version, please help us 
optimize it in subsequent versions
  

  was:
office series documents contain office series document attachment. Can the 
contents of the attachments be extracted as shown in the table below

 
|| ||doc||docx||xls||xlsx||ppt||pptx||
|txt|(/)|(/)|(/)|(/)|(x)|(/)| |
|pdf|(/)|(/)|(/)|(/)|(x)|(/)| |
|xml|(/)|(/)|(/)|(/)|(x)|(/)| |
|doc|(/)|(/)|(/)|(/)|(x)|(/)| |
|docx|(x)|(/)|(/)|(/)|(x)|(/)| |
|xls|(/)|(/)|(/)|(/)|(x)|(/)| |
|xlsx|(/)|(/)|(x)|(x)|(x)|(x)| |
|ppt|(/)|(/)|(/)|(/)|(x)|(/)| |
|pptx|(/)|(/)|(/)|(/)|(x)|(/)| |

 
 1.If our use method is wrong, please help us use the correct way
{code:java}
File file = new File("XX"); 
Parser parser = new OfficeParser(); 
 ParseContext context = new ParseContext();
 Metadata metadata = new Metadata();

metadata.set(HttpHeaders.CONTENT_ENCODING, "GB18030");
metadata.set(TikaMetadataKeys.RESOURCE_NAME_KEY, file.getName());
parser.parse(inputStream, handler, metadata, context);
{code}
 
 2.We use Tika version: 1.20. Of course, we have replaced the latest version 
2.0. This problem still exists.
  
 3.If there is indeed this omission in the current version, please help us 
optimize it in subsequent versions
  


> i cant extract content from attachments in the document
> -------------------------------------------------------
>
>                 Key: TIKA-3526
>                 URL: https://issues.apache.org/jira/browse/TIKA-3526
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.20
>            Reporter: matcha007
>            Priority: Major
>
> office series documents contain office series document attachment. Can the 
> contents of the attachments be extracted as shown in the table below
>  
> | |doc|docx|xls|xlsx|ppt|pptx|
> |txt|(/)|(/)|(/)|(/)|(x)|(/)|
> |pdf|(/)|(/)|(/)|(/)|(x)|(/)|
> |xml|(/)|(/)|(/)|(/)|(x)|(/)|
> |doc|(/)|(/)|(/)|(/)|(x)|(/)|
> |docx|(x)|(/)|(/)|(/)|(x)|(/)|
> |xls|(/)|(/)|(/)|(/)|(x)|(/)|
> |xlsx|(/)|(/)|(x)|(x)|(x)|(x)|
> |ppt|(/)|(/)|(/)|(/)|(x)|(/)|
> |pptx|(/)|(/)|(/)|(/)|(x)|(/)|
>  
>  1.If our use method is wrong, please help us use the correct way
> {code:java}
> File file = new File("XX"); 
> Parser parser = new OfficeParser(); 
>  ParseContext context = new ParseContext();
>  Metadata metadata = new Metadata();
> metadata.set(HttpHeaders.CONTENT_ENCODING, "GB18030");
> metadata.set(TikaMetadataKeys.RESOURCE_NAME_KEY, file.getName());
> parser.parse(inputStream, handler, metadata, context);
> {code}
>  
>  2.We use Tika version: 1.20. Of course, we have replaced the latest version 
> 2.0. This problem still exists.
>   
>  3.If there is indeed this omission in the current version, please help us 
> optimize it in subsequent versions
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (TIKA-3526) i cant extract content from attachments in the document

Reply via email to