[
https://issues.apache.org/jira/browse/TIKA-3657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17482534#comment-17482534
]
Tim Allison edited comment on TIKA-3657 at 1/26/22, 2:57 PM:
-------------------------------------------------------------
>I've just found that Docker is not relevant, that was fake news
Good to know.
I'll experiment with randomizing the order of ZipContainerDetectors and seeing
if order still matters. I'm pretty sure we fixed it so that it didn't, but I
can check again. If you wanted to add println statements (well, logging, of
course) you could see what's going on. We can add trace logging there if
that'd be of any use going forward.
[https://github.com/apache/tika/blob/ae4f26517a1e7c35cc54586e2fd9a41a17d17f74/tika-core/src/main/java/org/apache/tika/detect/CompositeDetector.java#L86]
[https://github.com/apache/tika/blob/ae4f26517a1e7c35cc54586e2fd9a41a17d17f74/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-zip-commons/src/main/java/org/apache/tika/detect/zip/DefaultZipContainerDetector.java#L197]
So, it could be an order issue or it could be a class loading issue inside the
actual tomcat.
was (Author: [email protected]):
>I've just found that Docker is not relevant, that was fake news
Good to know.
I'll experiment with randomizing the order of ZipContainerDetectors and seeing
if order still matters. I'm pretty sure we fixed it so that it didn't, but I
can check again. If you wanted to add println statements (well, logging, of
course) you could see what's going on. We can add trace logging there if
that'd be of any use going forward.
[https://github.com/apache/tika/blob/ae4f26517a1e7c35cc54586e2fd9a41a17d17f74/tika-core/src/main/java/org/apache/tika/detect/CompositeDetector.java#L86]
[https://github.com/apache/tika/blob/ae4f26517a1e7c35cc54586e2fd9a41a17d17f74/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-zip-commons/src/main/java/org/apache/tika/detect/zip/DefaultZipContainerDetector.java#L197]
> Microsoft documents are not text parsed when running under Docker
> -----------------------------------------------------------------
>
> Key: TIKA-3657
> URL: https://issues.apache.org/jira/browse/TIKA-3657
> Project: Tika
> Issue Type: Bug
> Components: config, core, depedency
> Affects Versions: 2.2.0, 2.2.1
> Reporter: Tim Barrett
> Priority: Major
> Fix For: 2.2.2
>
> Attachments: tika-config.xml
>
>
> We use EmbeddedDocumentExtractor, with this code:
> NalyticsEmbeddedDocumentExtractor nalyticsEmbeddedDocumentExtractor = *new*
> NalyticsEmbeddedDocumentExtractor(*this*);
> *this*.context.set(EmbeddedDocumentExtractor.*class*,
> nalyticsEmbeddedDocumentExtractor);
> This all works fine for us, and has been used in production for a few years.
> This also works under Tika 2.2.0 when running in development environments
> (Eclipse, Apache Tomcat). However when running under Docker the text
> withinMicrosoft documents (Word etc) is not parsed. Under Tika 2.1.0, under
> Docker, the Microsoft documents are fully parsed, so this problem was
> introduced in 2.2.0
> Interestingly, I found that if *anything at all* is added to the context via
> context.set the same problem occurs. Also, if the standard Tika Embedded
> Document Extractor is used the same problem occurs. Our Docker image contains
> our application's code which uses Tika, as well as Apache DS. The problem
> occurs running Docker on Ubuntu, Mac OS and Windows.
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)