[jira] [Comment Edited] (TIKA-3657) Microsoft documents are not text parsed when running under Docker

Tim Allison (Jira) Tue, 25 Jan 2022 04:12:23 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-3657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17481770#comment-17481770
 ]


Tim Allison edited comment on TIKA-3657 at 1/25/22, 12:11 PM:
--------------------------------------------------------------

I've updated the repo with a copy/paste of the above, and I'm not able to 
reproduce problems. This does not mean I don't trust you are having problems!

 

Does your pom exclude transitive dependencies?  You shouldn't need to pull in 
pdfbox or poi, etc if you're including {{{}tika-parsers-standard-package{}}}.

 
{noformat}
docker run -v 
/home/tallison/Intellij/tika-1.x/tika-parsers/src/test/resources/test-documents/:/data
 -v `pwd`:/output tika-app-custom-docker java -jar 
/tika-bin/tika-app-custom-docker.jar /tika-bin/my-tika-config.xml 
/data/testWORD.doc testWORD.xhtml{noformat}
The above yields actual content.  I'm building with Java 11, and you can see 
Docker is using Java 11.  Is there anyway you can modify my repo to get it to 
fail?


was (Author: [email protected]):
I've updated the repo with a copy/paste of the above, and I'm not able to 
reproduce problems. This does not mean I don't trust you are having problems!

 

Does your pom exclude transitive dependencies?  You shouldn't need to pull in 
pdfbox or poi, etc if you're including {{{}tika-parsers-standard-package{}}}.

 
{noformat}
docker run -v 
/home/tallison/Intellij/tika-1.x/tika-parsers/src/test/resources/test-documents/:/data
 -v `pwd`:/output tika-app-custom-docker java -jar 
/tika-bin/tika-app-custom-docker.jar /tika-bin/my-tika-config.xml 
/data/testWORD.doc testWORD.xhtml{noformat}
The above yields actual content.  I'm building with Java 11, and you can see 
Docker is using Java 11.  Is there anyway you can modify my repo to get it to 
faile?

> Microsoft documents are not text parsed when running under Docker
> -----------------------------------------------------------------
>
>                 Key: TIKA-3657
>                 URL: https://issues.apache.org/jira/browse/TIKA-3657
>             Project: Tika
>          Issue Type: Bug
>          Components: config, core, depedency
>    Affects Versions: 2.2.0, 2.2.1
>            Reporter: Tim Barrett
>            Priority: Major
>             Fix For: 2.2.2
>
>         Attachments: tika-config.xml
>
>
> We use EmbeddedDocumentExtractor, with this code:
> NalyticsEmbeddedDocumentExtractor nalyticsEmbeddedDocumentExtractor = *new* 
> NalyticsEmbeddedDocumentExtractor(*this*);
> *this*.context.set(EmbeddedDocumentExtractor.*class*, 
> nalyticsEmbeddedDocumentExtractor);
> This all works fine for us, and has been used in production for a few years. 
> This also works under Tika 2.2.0 when running in development environments 
> (Eclipse, Apache Tomcat). However when running under Docker the text 
> withinMicrosoft documents (Word etc) is not parsed. Under Tika 2.1.0, under 
> Docker, the Microsoft documents are fully parsed, so this problem was 
> introduced in 2.2.0
> Interestingly, I found that if *anything at all* is added to the context via 
> context.set the same problem occurs. Also, if the standard Tika Embedded 
> Document Extractor is used the same problem occurs. Our Docker image contains 
> our application's code which uses Tika, as well as Apache DS. The problem 
> occurs running Docker on Ubuntu, Mac OS and Windows.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Comment Edited] (TIKA-3657) Microsoft documents are not text parsed when running under Docker

Reply via email to