Re: Help: the 2007 version of office series is embedded in the 2013 version of office series, and individual types cannot extract content

Tim Allison Wed, 04 Aug 2021 04:33:12 -0700

Can you open an issue on our JIRA and attach an example file?

Thank you!


On Wed, Aug 4, 2021 at 7:31 AM sanliang_fighting <[email protected]>
wrote:

> Hello:
>
> Thank you very much for providing such a powerful and practical product.
> During the use, we found some problems in the current version. I hope your
> team can help us solve this problem.
>
> 1： If our use method is wrong, please help us use the correct way
>
> File file = new File("XX");
> Parser parser = new OfficeParser();
>  ParseContext context = new ParseContext();
>  Metadata metadata = new Metadata();
>
> metadata.set(HttpHeaders.CONTENT_ENCODING, "GB18030");
> metadata.set(TikaMetadataKeys.RESOURCE_NAME_KEY, file.getName());
> parser.parse(inputStream, handler, metadata, context);
>
> 2： If there is indeed this omission in the current version, please help us
> optimize it in subsequent versions
>
> 3： We use Tika version: 1.20. Of course, we have replaced the latest
> version 2.0. This problem still exists.
>
> In general, the problem is that objects of 2013 and above are inserted
> into office series documents in 2007, and the inserted objects will be
> automatically ignored when Tika extracts content. As a result, the contents
> of the object cannot be extracted. We sorted out the detailed table of
> whether the content can be extracted normally as follows:
>
> 文件
>
>
>
> 附件(插入对象)
>
> doc
>
> docx
>
> xls
>
> xlsx
>
> ppt
>
> pptx
>
> txt
>
> Y
>
> Y
>
> Y
>
> Y
>
> N
>
> Y
>
> pdf
>
> Y
>
> Y
>
> Y
>
> Y
>
> N
>
> Y
>
> xml
>
> Y
>
> Y
>
> Y
>
> Y
>
> N
>
> Y
>
> doc
>
> Y
>
> Y
>
> Y
>
> Y
>
> N
>
> Y
>
> docx
>
> N
>
> Y
>
> Y
>
> Y
>
> N
>
> Y
>
> xls
>
> Y
>
> Y
>
> Y
>
> Y
>
> N
>
> Y
>
> xlsx
>
> Y
>
> Y
>
> N
>
> N
>
> N
>
> N
>
> ppt
>
> Y
>
> Y
>
> Y
>
> Y
>
> N
>
> Y
>
> pptx
>
> Y
>
> Y
>
> Y
>
> Y
>
> N
>
> Y
> *We look forward to receiving your reply. Thank you.*
>
> 2021-08-04
> ------------------------------
> sanliang_fighting
>

Re: Help: the 2007 version of office series is embedded in the 2013 version of office series, and individual types cannot extract content

Reply via email to