Tim Barrett created TIKA-3657:
---------------------------------

             Summary: Microsoft documents are not text parsed when running 
under Docker
                 Key: TIKA-3657
                 URL: https://issues.apache.org/jira/browse/TIKA-3657
             Project: Tika
          Issue Type: Bug
          Components: config, core, depedency
    Affects Versions: 2.2.1, 2.2.0
            Reporter: Tim Barrett
             Fix For: 2.2.2
         Attachments: tika-config.xml

We use EmbeddedDocumentExtractor, with this code:

NalyticsEmbeddedDocumentExtractor nalyticsEmbeddedDocumentExtractor = *new* 
NalyticsEmbeddedDocumentExtractor(*this*);

*this*.context.set(EmbeddedDocumentExtractor.*class*, 
nalyticsEmbeddedDocumentExtractor);

This all works fine for us, and has been used in production for a few years. 
This also works under Tika 2.2.0 when running in development environments 
(Eclipse, Apache Tomcat). However when running under Docker the text 
withinMicrosoft documents (Word etc) is not parsed. Under Tika 2.1.0, under 
Docker, the Microsoft documents are fully parsed, so this problem was 
introduced in 2.2.0

Interestingly, I found that if *anything at all* is added to the context via 
context.set the same problem occurs. Also, if the standard Tika Embedded 
Document Extractor is used the same problem occurs. Our Docker image contains 
our application's code which uses Tika, as well as Apache DS. The problem 
occurs running Docker on Ubuntu, Mac OS and Windows.

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to