[jira] [Commented] (TIKA-3657) Microsoft documents are not text parsed when running under Docker

Tim Barrett (Jira) Thu, 03 Feb 2022 09:00:05 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-3657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17486598#comment-17486598
 ]


Tim Barrett commented on TIKA-3657:
-----------------------------------

I have it working :)

I was seeing a problem with another email processing piece of our code, where 
jakarta activation classes could not be loaded by Tomcat. Our app has 2 major 
projects; the 'core' app and the web app. The core POM contains the Tika 
dependencies and all the other important stuff we use. The web project is a 
service layer providing rest services that themselves mainly delegate to core 
functions. Anyway it would appear that dynamic class loading is being handled 
by the Tomcat webclass loader. So adding jakarta.activation-api as a dependency 
in the web app solved the problem I was seeing with the mail api but it solved 
more than that - it solved this problem too, the config file with the markLimit 
param now works.

I am still of the opinion that it would be better to have a markLimit value of 
128 * 1024 * 1024 i in the POIFS Detector - otherwise anybody else trying to 
parse large msg files will likely hit the same problem and the config param 
solution is not obvious (imo).

> Microsoft documents are not text parsed when running under Docker
> -----------------------------------------------------------------
>
>                 Key: TIKA-3657
>                 URL: https://issues.apache.org/jira/browse/TIKA-3657
>             Project: Tika
>          Issue Type: Bug
>          Components: config, core, depedency
>    Affects Versions: 2.2.0, 2.2.1
>            Reporter: Tim Barrett
>            Priority: Major
>             Fix For: 2.2.2
>
>         Attachments: POIFSContainerDetector.java, scenario traces.txt, 
> tika-config.xml
>
>
> We use EmbeddedDocumentExtractor, with this code:
> NalyticsEmbeddedDocumentExtractor nalyticsEmbeddedDocumentExtractor = *new* 
> NalyticsEmbeddedDocumentExtractor(*this*);
> *this*.context.set(EmbeddedDocumentExtractor.*class*, 
> nalyticsEmbeddedDocumentExtractor);
> This all works fine for us, and has been used in production for a few years. 
> This also works under Tika 2.2.0 when running in development environments 
> (Eclipse, Apache Tomcat). However when running under Docker the text 
> withinMicrosoft documents (Word etc) is not parsed. Under Tika 2.1.0, under 
> Docker, the Microsoft documents are fully parsed, so this problem was 
> introduced in 2.2.0
> Interestingly, I found that if *anything at all* is added to the context via 
> context.set the same problem occurs. Also, if the standard Tika Embedded 
> Document Extractor is used the same problem occurs. Our Docker image contains 
> our application's code which uses Tika, as well as Apache DS. The problem 
> occurs running Docker on Ubuntu, Mac OS and Windows.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (TIKA-3657) Microsoft documents are not text parsed when running under Docker

Reply via email to