[
https://issues.apache.org/jira/browse/TIKA-3657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17486598#comment-17486598
]
Tim Barrett commented on TIKA-3657:
-----------------------------------
I have it working :)
I was seeing a problem with another email processing piece of our code, where
jakarta activation classes could not be loaded by Tomcat. Our app has 2 major
projects; the 'core' app and the web app. The core POM contains the Tika
dependencies and all the other important stuff we use. The web project is a
service layer providing rest services that themselves mainly delegate to core
functions. Anyway it would appear that dynamic class loading is being handled
by the Tomcat webclass loader. So adding jakarta.activation-api as a dependency
in the web app solved the problem I was seeing with the mail api but it solved
more than that - it solved this problem too, the config file with the markLimit
param now works.
I am still of the opinion that it would be better to have a markLimit value of
128 * 1024 * 1024 i in the POIFS Detector - otherwise anybody else trying to
parse large msg files will likely hit the same problem and the config param
solution is not obvious (imo).
> Microsoft documents are not text parsed when running under Docker
> -----------------------------------------------------------------
>
> Key: TIKA-3657
> URL: https://issues.apache.org/jira/browse/TIKA-3657
> Project: Tika
> Issue Type: Bug
> Components: config, core, depedency
> Affects Versions: 2.2.0, 2.2.1
> Reporter: Tim Barrett
> Priority: Major
> Fix For: 2.2.2
>
> Attachments: POIFSContainerDetector.java, scenario traces.txt,
> tika-config.xml
>
>
> We use EmbeddedDocumentExtractor, with this code:
> NalyticsEmbeddedDocumentExtractor nalyticsEmbeddedDocumentExtractor = *new*
> NalyticsEmbeddedDocumentExtractor(*this*);
> *this*.context.set(EmbeddedDocumentExtractor.*class*,
> nalyticsEmbeddedDocumentExtractor);
> This all works fine for us, and has been used in production for a few years.
> This also works under Tika 2.2.0 when running in development environments
> (Eclipse, Apache Tomcat). However when running under Docker the text
> withinMicrosoft documents (Word etc) is not parsed. Under Tika 2.1.0, under
> Docker, the Microsoft documents are fully parsed, so this problem was
> introduced in 2.2.0
> Interestingly, I found that if *anything at all* is added to the context via
> context.set the same problem occurs. Also, if the standard Tika Embedded
> Document Extractor is used the same problem occurs. Our Docker image contains
> our application's code which uses Tika, as well as Apache DS. The problem
> occurs running Docker on Ubuntu, Mac OS and Windows.
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)