Hello:
Thank you very much for providing such a powerful and practical product. During
the use, we found some problems in the current version. I hope your team can
help us solve this problem.
1: If our use method is wrong, please help us use the correct way
File file = new File("XX"); Parser parser = new OfficeParser(); ParseContext
context = new ParseContext(); Metadata metadata = new
Metadata();metadata.set(HttpHeaders.CONTENT_ENCODING,
"GB18030");metadata.set(TikaMetadataKeys.RESOURCE_NAME_KEY,
file.getName());parser.parse(inputStream, handler, metadata, context);
2: If there is indeed this omission in the current version, please help us
optimize it in subsequent versions
3: We use Tika version: 1.20. Of course, we have replaced the latest version
2.0. This problem still exists.
In general, the problem is that objects of 2013 and above are inserted into
office series documents in 2007, and the inserted objects will be automatically
ignored when Tika extracts content. As a result, the contents of the object
cannot be extracted. We sorted out the detailed table of whether the content
can be extracted normally as follows:
文件
附件(插入对象)docdocxxlsxlsxpptpptx
txtYYYYNY
pdfYYYYNY
xmlYYYYNY
docYYYYNY
docxNYYYNY
xlsYYYYNY
xlsxYYNNNN
pptYYYYNY
pptxYYYYNY
We look forward to receiving your reply. Thank you.
2021-08-04
sanliang_fighting