Tim Barrett created TIKA-3657:
---------------------------------
Summary: Microsoft documents are not text parsed when running
under Docker
Key: TIKA-3657
URL: https://issues.apache.org/jira/browse/TIKA-3657
Project: Tika
Issue Type: Bug
Components: config, core, depedency
Affects Versions: 2.2.1, 2.2.0
Reporter: Tim Barrett
Fix For: 2.2.2
Attachments: tika-config.xml
We use EmbeddedDocumentExtractor, with this code:
NalyticsEmbeddedDocumentExtractor nalyticsEmbeddedDocumentExtractor = *new*
NalyticsEmbeddedDocumentExtractor(*this*);
*this*.context.set(EmbeddedDocumentExtractor.*class*,
nalyticsEmbeddedDocumentExtractor);
This all works fine for us, and has been used in production for a few years.
This also works under Tika 2.2.0 when running in development environments
(Eclipse, Apache Tomcat). However when running under Docker the text
withinMicrosoft documents (Word etc) is not parsed. Under Tika 2.1.0, under
Docker, the Microsoft documents are fully parsed, so this problem was
introduced in 2.2.0
Interestingly, I found that if *anything at all* is added to the context via
context.set the same problem occurs. Also, if the standard Tika Embedded
Document Extractor is used the same problem occurs. Our Docker image contains
our application's code which uses Tika, as well as Apache DS. The problem
occurs running Docker on Ubuntu, Mac OS and Windows.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)