[ 
https://issues.apache.org/jira/browse/TIKA-3657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17485409#comment-17485409
 ] 

Tim Allison edited comment on TIKA-3657 at 2/1/22, 6:33 PM:
------------------------------------------------------------

I physically removed a detector from the jar/war hoping that that might prevent 
the loading of classes after that, and it doesn't.  I configured a misspelled 
detector hoping that might prevent the loading of classes after that, and it 
doesn't.

If I set a value > Integer.MAX in the config file, I get something that is not 
a silent failure of class loading:

{noformat}
re><p><b>Root Cause</b></p><pre>java.lang.NumberFormatException: For input 
string: &quot;13423423424322217728&quot;
        
java.base&#47;java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
        java.base&#47;java.lang.Integer.parseInt(Integer.java:652)
        java.base&#47;java.lang.Integer.&lt;init&gt;(Integer.java:1105)
        
java.base&#47;jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native
 Method)
        
java.base&#47;jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        
java.base&#47;jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        
java.base&#47;java.lang.reflect.Constructor.newInstance(Constructor.java:490)
        org.apache.tika.config.Param.getTypedValue(Param.java:282)
        org.apache.tika.config.Param.load(Param.java:188)
        
org.apache.tika.config.TikaConfig$XmlLoader.getParams(TikaConfig.java:793)
        org.apache.tika.config.TikaConfig$XmlLoader.loadOne(TikaConfig.java:682)
        
org.apache.tika.config.TikaConfig$XmlLoader.loadOverall(TikaConfig.java:621)
        org.apache.tika.config.TikaConfig.&lt;init&gt;(TikaConfig.java:155)
        org.apache.tika.config.TikaConfig.&lt;init&gt;(TikaConfig.java:141)
        org.apache.tika.config.TikaConfig.&lt;init&gt;(TikaConfig.java:133)
        org.apache.tika.config.TikaConfig.&lt;init&gt;(TikaConfig.java:129)
        MyServlet.doPut(MyServlet.java:47)
        javax.servlet.http.HttpServlet.service(HttpServlet.java:684)
        javax.servlet.http.HttpServlet.service(HttpServlet.java:764)
        org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)
{noformat}

If I misspell the DefaultZipContainerDetector, it simply doesn't load, but the 
file is parsed by the PackageParser and then the xml parser so there's still a 
bunch of content.


was (Author: [email protected]):
I physically removed a detector from the jar/war hoping that that might prevent 
the loading of classes after that, and it doesn't.  I configured a misspelled 
detector hoping that might prevent the loading of classes after that, and it 
doesn't.

If I set a value > Integer.MAX in the config file, I get something that is not 
a silent failure of class loading:

{noformat}
re><p><b>Root Cause</b></p><pre>java.lang.NumberFormatException: For input 
string: &quot;13423423424322217728&quot;
        
java.base&#47;java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
        java.base&#47;java.lang.Integer.parseInt(Integer.java:652)
        java.base&#47;java.lang.Integer.&lt;init&gt;(Integer.java:1105)
        
java.base&#47;jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native
 Method)
        
java.base&#47;jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        
java.base&#47;jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        
java.base&#47;java.lang.reflect.Constructor.newInstance(Constructor.java:490)
        org.apache.tika.config.Param.getTypedValue(Param.java:282)
        org.apache.tika.config.Param.load(Param.java:188)
        
org.apache.tika.config.TikaConfig$XmlLoader.getParams(TikaConfig.java:793)
        org.apache.tika.config.TikaConfig$XmlLoader.loadOne(TikaConfig.java:682)
        
org.apache.tika.config.TikaConfig$XmlLoader.loadOverall(TikaConfig.java:621)
        org.apache.tika.config.TikaConfig.&lt;init&gt;(TikaConfig.java:155)
        org.apache.tika.config.TikaConfig.&lt;init&gt;(TikaConfig.java:141)
        org.apache.tika.config.TikaConfig.&lt;init&gt;(TikaConfig.java:133)
        org.apache.tika.config.TikaConfig.&lt;init&gt;(TikaConfig.java:129)
        MyServlet.doPut(MyServlet.java:47)
        javax.servlet.http.HttpServlet.service(HttpServlet.java:684)
        javax.servlet.http.HttpServlet.service(HttpServlet.java:764)
        org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)
{noformat}

> Microsoft documents are not text parsed when running under Docker
> -----------------------------------------------------------------
>
>                 Key: TIKA-3657
>                 URL: https://issues.apache.org/jira/browse/TIKA-3657
>             Project: Tika
>          Issue Type: Bug
>          Components: config, core, depedency
>    Affects Versions: 2.2.0, 2.2.1
>            Reporter: Tim Barrett
>            Priority: Major
>             Fix For: 2.2.2
>
>         Attachments: scenario traces.txt, tika-config.xml
>
>
> We use EmbeddedDocumentExtractor, with this code:
> NalyticsEmbeddedDocumentExtractor nalyticsEmbeddedDocumentExtractor = *new* 
> NalyticsEmbeddedDocumentExtractor(*this*);
> *this*.context.set(EmbeddedDocumentExtractor.*class*, 
> nalyticsEmbeddedDocumentExtractor);
> This all works fine for us, and has been used in production for a few years. 
> This also works under Tika 2.2.0 when running in development environments 
> (Eclipse, Apache Tomcat). However when running under Docker the text 
> withinMicrosoft documents (Word etc) is not parsed. Under Tika 2.1.0, under 
> Docker, the Microsoft documents are fully parsed, so this problem was 
> introduced in 2.2.0
> Interestingly, I found that if *anything at all* is added to the context via 
> context.set the same problem occurs. Also, if the standard Tika Embedded 
> Document Extractor is used the same problem occurs. Our Docker image contains 
> our application's code which uses Tika, as well as Apache DS. The problem 
> occurs running Docker on Ubuntu, Mac OS and Windows.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to