Help: the 2007 version of office series is embedded in the 2013 version of office series, and individual types cannot extract content

sanliang_fighting Wed, 04 Aug 2021 04:31:49 -0700

Hello:

Thank you very much for providing such a powerful and practical product. During 
the use, we found some problems in the current version. I hope your team can 
help us solve this problem.


1： If our use method is wrong, please help us use the correct way
File file = new File("XX"); Parser parser = new OfficeParser();  ParseContext 
context = new ParseContext(); Metadata metadata = new 
Metadata();metadata.set(HttpHeaders.CONTENT_ENCODING, 
"GB18030");metadata.set(TikaMetadataKeys.RESOURCE_NAME_KEY, 
file.getName());parser.parse(inputStream, handler, metadata, context);
2： If there is indeed this omission in the current version, please help us 
optimize it in subsequent versions

3： We use Tika version: 1.20. Of course, we have replaced the latest version 
2.0. This problem still exists.

In general, the problem is that objects of 2013 and above are inserted into 
office series documents in 2007, and the inserted objects will be automatically 
ignored when Tika extracts content. As a result, the contents of the object 
cannot be extracted. We sorted out the detailed table of whether the content 
can be extracted normally as follows:
文件

 
附件(插入对象)docdocxxlsxlsxpptpptx
txtYYYYNY
pdfYYYYNY
xmlYYYYNY
docYYYYNY
docxNYYYNY
xlsYYYYNY
xlsxYYNNNN
pptYYYYNY
pptxYYYYNY

We look forward to receiving your reply. Thank you.

2021-08-04


sanliang_fighting

Help: the 2007 version of office series is embedded in the 2013 version of office series, and individual types cannot extract content

Reply via email to