[jira] [Commented] (TIKA-3657) Microsoft documents are not text parsed when running under Docker

Tim Barrett (Jira) Tue, 01 Feb 2022 06:10:04 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-3657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17485268#comment-17485268
 ]


Tim Barrett commented on TIKA-3657:
-----------------------------------

I've eliminated the possibility that the Tomcat image you used and ours was the 
cause. I switched your DockerFile to use the same Tomcat image we used and your 
servlet worked fine.

I also changed things around in your servlet to our order of setting parsers 
etc and that didn't repro our problem.

I'm pretty sure the problem is in the initial loading of the config. Checking 
the detectors in the TikaConfig after I load it, I only see:

configFilePath: /usr/local/tomcat/conf/nalanda/tika-config.xml

config exists: true

 storedDetector in  config org.apache.tika.detect.OverrideDetector@4c4be666

 storedDetector in  config org.gagravarr.tika.OggDetector@73fe1f76

 storedDetector in  config org.apache.tika.mime.MimeTypes@1fa249a1

I see that the config loading gets down deep into XML DOM loading. Any clues 
here?

> Microsoft documents are not text parsed when running under Docker
> -----------------------------------------------------------------
>
>                 Key: TIKA-3657
>                 URL: https://issues.apache.org/jira/browse/TIKA-3657
>             Project: Tika
>          Issue Type: Bug
>          Components: config, core, depedency
>    Affects Versions: 2.2.0, 2.2.1
>            Reporter: Tim Barrett
>            Priority: Major
>             Fix For: 2.2.2
>
>         Attachments: scenario traces.txt, tika-config.xml
>
>
> We use EmbeddedDocumentExtractor, with this code:
> NalyticsEmbeddedDocumentExtractor nalyticsEmbeddedDocumentExtractor = *new* 
> NalyticsEmbeddedDocumentExtractor(*this*);
> *this*.context.set(EmbeddedDocumentExtractor.*class*, 
> nalyticsEmbeddedDocumentExtractor);
> This all works fine for us, and has been used in production for a few years. 
> This also works under Tika 2.2.0 when running in development environments 
> (Eclipse, Apache Tomcat). However when running under Docker the text 
> withinMicrosoft documents (Word etc) is not parsed. Under Tika 2.1.0, under 
> Docker, the Microsoft documents are fully parsed, so this problem was 
> introduced in 2.2.0
> Interestingly, I found that if *anything at all* is added to the context via 
> context.set the same problem occurs. Also, if the standard Tika Embedded 
> Document Extractor is used the same problem occurs. Our Docker image contains 
> our application's code which uses Tika, as well as Apache DS. The problem 
> occurs running Docker on Ubuntu, Mac OS and Windows.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (TIKA-3657) Microsoft documents are not text parsed when running under Docker

Reply via email to