[jira] [Commented] (TIKA-3657) Microsoft documents are not text parsed when running under Docker

Tim Barrett (Jira) Wed, 02 Feb 2022 02:40:04 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-3657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17485696#comment-17485696
 ]


Tim Barrett commented on TIKA-3657:
-----------------------------------

We are getting much closer now. 

If I use a Tika config without the mark limit, content parsing is complete, 
also using Docker.

<properties>
   <detectors>
       <detector class="org.apache.tika.detect.OverrideDetector"/>
       <detector 
class="org.apache.tika.parser.microsoft.POIFSContainerDetector""/>
       <detector class="org.apache.tika.parser.pkg.ZipContainerDetector"/>
       <detector class="org.gagravarr.tika.OggDetector"/>
       <detector class="org.apache.tika.mime.MimeTypes"/>
   </detectors>
</properties>

WITH the markLimit, it fails:

<properties>
   <detectors>
       <detector class="org.apache.tika.detect.OverrideDetector"/>
       <detector class="org.gagravarr.tika.OggDetector"/>
       <detector class="org.apache.tika.detect.apple.BPListDetector"/>
       <detector 
class="org.apache.tika.detect.microsoft.POIFSContainerDetector">
           <params>
               <param name="markLimit" type="int">134217728</param>
           </params>
       </detector>
       <detector class="org.apache.tika.detect.ole.MiscOLEDetector"/>
       <detector 
class="org.apache.tika.detect.zip.DefaultZipContainerDetector"/>
       <!-- <detector class="org.apache.tika.parser.pkg.ZipContainerDetector"/>
       <detector class="org.gagravarr.tika.OggDetector"/> -->
       <detector class="org.apache.tika.mime.MimeTypes"/>
   </detectors>
</properties>

As I mentioned yesterday, use of the config could be avoided entirely (in this 
case) if the markLimit on the POIFS detector (or all detectors) could be set to 
128 * 1024 * 1024 as in attached.[^POIFSContainerDetector.java]

> Microsoft documents are not text parsed when running under Docker
> -----------------------------------------------------------------
>
>                 Key: TIKA-3657
>                 URL: https://issues.apache.org/jira/browse/TIKA-3657
>             Project: Tika
>          Issue Type: Bug
>          Components: config, core, depedency
>    Affects Versions: 2.2.0, 2.2.1
>            Reporter: Tim Barrett
>            Priority: Major
>             Fix For: 2.2.2
>
>         Attachments: POIFSContainerDetector.java, scenario traces.txt, 
> tika-config.xml
>
>
> We use EmbeddedDocumentExtractor, with this code:
> NalyticsEmbeddedDocumentExtractor nalyticsEmbeddedDocumentExtractor = *new* 
> NalyticsEmbeddedDocumentExtractor(*this*);
> *this*.context.set(EmbeddedDocumentExtractor.*class*, 
> nalyticsEmbeddedDocumentExtractor);
> This all works fine for us, and has been used in production for a few years. 
> This also works under Tika 2.2.0 when running in development environments 
> (Eclipse, Apache Tomcat). However when running under Docker the text 
> withinMicrosoft documents (Word etc) is not parsed. Under Tika 2.1.0, under 
> Docker, the Microsoft documents are fully parsed, so this problem was 
> introduced in 2.2.0
> Interestingly, I found that if *anything at all* is added to the context via 
> context.set the same problem occurs. Also, if the standard Tika Embedded 
> Document Extractor is used the same problem occurs. Our Docker image contains 
> our application's code which uses Tika, as well as Apache DS. The problem 
> occurs running Docker on Ubuntu, Mac OS and Windows.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (TIKA-3657) Microsoft documents are not text parsed when running under Docker

Reply via email to