Can you open an issue on our JIRA and attach an example file? Thank you!
On Wed, Aug 4, 2021 at 7:31 AM sanliang_fighting <[email protected]> wrote: > Hello: > > Thank you very much for providing such a powerful and practical product. > During the use, we found some problems in the current version. I hope your > team can help us solve this problem. > > 1: If our use method is wrong, please help us use the correct way > > File file = new File("XX"); > Parser parser = new OfficeParser(); > ParseContext context = new ParseContext(); > Metadata metadata = new Metadata(); > > metadata.set(HttpHeaders.CONTENT_ENCODING, "GB18030"); > metadata.set(TikaMetadataKeys.RESOURCE_NAME_KEY, file.getName()); > parser.parse(inputStream, handler, metadata, context); > > 2: If there is indeed this omission in the current version, please help us > optimize it in subsequent versions > > 3: We use Tika version: 1.20. Of course, we have replaced the latest > version 2.0. This problem still exists. > > In general, the problem is that objects of 2013 and above are inserted > into office series documents in 2007, and the inserted objects will be > automatically ignored when Tika extracts content. As a result, the contents > of the object cannot be extracted. We sorted out the detailed table of > whether the content can be extracted normally as follows: > > 文件 > > > > 附件(插入对象) > > doc > > docx > > xls > > xlsx > > ppt > > pptx > > txt > > Y > > Y > > Y > > Y > > N > > Y > > pdf > > Y > > Y > > Y > > Y > > N > > Y > > xml > > Y > > Y > > Y > > Y > > N > > Y > > doc > > Y > > Y > > Y > > Y > > N > > Y > > docx > > N > > Y > > Y > > Y > > N > > Y > > xls > > Y > > Y > > Y > > Y > > N > > Y > > xlsx > > Y > > Y > > N > > N > > N > > N > > ppt > > Y > > Y > > Y > > Y > > N > > Y > > pptx > > Y > > Y > > Y > > Y > > N > > Y > *We look forward to receiving your reply. Thank you.* > > 2021-08-04 > ------------------------------ > sanliang_fighting >
