[jira] [Comment Edited] (TIKA-3657) Microsoft documents are not text parsed when running under Docker

Tim Allison (Jira) Fri, 21 Jan 2022 03:17:04 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-3657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17480007#comment-17480007
 ]


Tim Allison edited comment on TIKA-3657 at 1/21/22, 11:16 AM:
--------------------------------------------------------------

To confirm, if you set anything in a ParseContext(), content from MSOffice 
documents (.doc, .docx etc) is not extracted whether or not those documents are 
embedded in another file?  Or, are you saying that files embedded within 
.doc/.docx/etc files are not extracted?

 

If you don't use your custom tika-config.xml, are you seeing the same behavior?

 

>{*}this{*}.context.set(EmbeddedDocumentExtractor.{*}class{*}, 
>nalyticsEmbeddedDocumentExtractor);

 

Is the *this* used in a multithreaded context or this that only single threaded?

 

Are you getting any exceptions or logging?

 

Are you using the standard AutoDetectParser or the RecursiveParserWrapper?

 

Parsing is working on other files and other embedded files?

 

Are you able to share your Docker file or a minimal project that shows this 
problem?


was (Author: [email protected]):
To confirm, if you set anything in a ParseContext(), content from MSOffice 
documents (.doc, .docx etc) is not extracted whether or not those documents are 
embedded in another file?  Or, are you saying that files embedded within 
.doc/.docx/etc files are not extracted?

 

If you don't use your custom tika-config.xml, are you seeing the same behavior?

 

>{*}this{*}.context.set(EmbeddedDocumentExtractor.{*}class{*}, 
>nalyticsEmbeddedDocumentExtractor);

 

Is the *this* used in a multithreaded context or this that only single threaded?

 

Are you getting any exceptions or logging?

 

Are you able to share your Docker file or a minimal project that shows this 
problem?

> Microsoft documents are not text parsed when running under Docker
> -----------------------------------------------------------------
>
>                 Key: TIKA-3657
>                 URL: https://issues.apache.org/jira/browse/TIKA-3657
>             Project: Tika
>          Issue Type: Bug
>          Components: config, core, depedency
>    Affects Versions: 2.2.0, 2.2.1
>            Reporter: Tim Barrett
>            Priority: Major
>             Fix For: 2.2.2
>
>         Attachments: tika-config.xml
>
>
> We use EmbeddedDocumentExtractor, with this code:
> NalyticsEmbeddedDocumentExtractor nalyticsEmbeddedDocumentExtractor = *new* 
> NalyticsEmbeddedDocumentExtractor(*this*);
> *this*.context.set(EmbeddedDocumentExtractor.*class*, 
> nalyticsEmbeddedDocumentExtractor);
> This all works fine for us, and has been used in production for a few years. 
> This also works under Tika 2.2.0 when running in development environments 
> (Eclipse, Apache Tomcat). However when running under Docker the text 
> withinMicrosoft documents (Word etc) is not parsed. Under Tika 2.1.0, under 
> Docker, the Microsoft documents are fully parsed, so this problem was 
> introduced in 2.2.0
> Interestingly, I found that if *anything at all* is added to the context via 
> context.set the same problem occurs. Also, if the standard Tika Embedded 
> Document Extractor is used the same problem occurs. Our Docker image contains 
> our application's code which uses Tika, as well as Apache DS. The problem 
> occurs running Docker on Ubuntu, Mac OS and Windows.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Comment Edited] (TIKA-3657) Microsoft documents are not text parsed when running under Docker

Reply via email to