[ 
https://issues.apache.org/jira/browse/TIKA-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated TIKA-1568:
------------------------------------
    Description: 
Parsing performance of many text files suffers from repeated calls to 
ServiceLoader.loadServiceProviders(EncodingDetector.class). This happens in 
TXTParser, HTMLParser and SourceCodeParser. In most cases, when Tika is using 
the default ServiceLoader instance created in the Parser's static section this 
cost can be avoided by caching the resulting List<EncodingDetector> either at a 
higher level in the Parser (as a static property). If using custom 
ServiceLoader-s this can be achieved by putting this list in ParsingContext, or 
caching these lists at a lower level in the ServiceLoader component.

Relevant part of  the stacktrace follows:
{code}
   java.lang.Thread.State: BLOCKED (on object monitor)
        at java.util.zip.ZipFile.getEntry(ZipFile.java:304)
        - locked <0x00000007909d2e48> (a java.util.jar.JarFile)
        at java.util.jar.JarFile.getEntry(JarFile.java:227)
        at java.util.jar.JarFile.getJarEntry(JarFile.java:210)
        at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:840)
        at sun.misc.URLClassPath$JarLoader.findResource(URLClassPath.java:818)
        at sun.misc.URLClassPath$1.next(URLClassPath.java:226)
        at sun.misc.URLClassPath$1.hasMoreElements(URLClassPath.java:236)
        at java.net.URLClassLoader$3$1.run(URLClassLoader.java:583)
        at java.net.URLClassLoader$3$1.run(URLClassLoader.java:581)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader$3.next(URLClassLoader.java:580)
        at java.net.URLClassLoader$3.hasMoreElements(URLClassLoader.java:605)
        at java.util.Collections.list(Collections.java:3687)
        at 
org.eclipse.jetty.webapp.WebAppClassLoader.toList(WebAppClassLoader.java:337)
        at 
org.eclipse.jetty.webapp.WebAppClassLoader.getResources(WebAppClassLoader.java:321)
        at 
org.apache.tika.config.ServiceLoader.findServiceResources(ServiceLoader.java:210)
        at 
org.apache.tika.config.ServiceLoader.identifyStaticServiceProviders(ServiceLoader.java:277)
        at 
org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:306)
        at 
org.apache.tika.config.ServiceLoader.loadServiceProviders(ServiceLoader.java:228)
        at 
org.apache.tika.detect.AutoDetectReader.<init>(AutoDetectReader.java:104)
        at org.apache.tika.parser.txt.TXTParser.parse(TXTParser.java:70)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
        at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
...
{code}

  was:
Performance of parsing many plain text files suffers from repeated calls to 
ServiceLoader.loadServiceProviders(EncodingDetector.class). In most cases, when 
Tika is using the default ServiceLoader instance created in TXTParser, this 
cost can be avoided by caching the resulting List<EncodingDetector> either at a 
higher level in TXTParser (e.g. by putting it in ParsingContext) or at a lower 
level in ServiceLoader.

Relevant part of  the stacktrace follows:
{code}
   java.lang.Thread.State: BLOCKED (on object monitor)
        at java.util.zip.ZipFile.getEntry(ZipFile.java:304)
        - locked <0x00000007909d2e48> (a java.util.jar.JarFile)
        at java.util.jar.JarFile.getEntry(JarFile.java:227)
        at java.util.jar.JarFile.getJarEntry(JarFile.java:210)
        at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:840)
        at sun.misc.URLClassPath$JarLoader.findResource(URLClassPath.java:818)
        at sun.misc.URLClassPath$1.next(URLClassPath.java:226)
        at sun.misc.URLClassPath$1.hasMoreElements(URLClassPath.java:236)
        at java.net.URLClassLoader$3$1.run(URLClassLoader.java:583)
        at java.net.URLClassLoader$3$1.run(URLClassLoader.java:581)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader$3.next(URLClassLoader.java:580)
        at java.net.URLClassLoader$3.hasMoreElements(URLClassLoader.java:605)
        at java.util.Collections.list(Collections.java:3687)
        at 
org.eclipse.jetty.webapp.WebAppClassLoader.toList(WebAppClassLoader.java:337)
        at 
org.eclipse.jetty.webapp.WebAppClassLoader.getResources(WebAppClassLoader.java:321)
        at 
org.apache.tika.config.ServiceLoader.findServiceResources(ServiceLoader.java:210)
        at 
org.apache.tika.config.ServiceLoader.identifyStaticServiceProviders(ServiceLoader.java:277)
        at 
org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:306)
        at 
org.apache.tika.config.ServiceLoader.loadServiceProviders(ServiceLoader.java:228)
        at 
org.apache.tika.detect.AutoDetectReader.<init>(AutoDetectReader.java:104)
        at org.apache.tika.parser.txt.TXTParser.parse(TXTParser.java:70)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
        at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
...
{code}

        Summary: AutoDetectReader performance problem  (was: TXTParser 
performance problem)

> AutoDetectReader performance problem
> ------------------------------------
>
>                 Key: TIKA-1568
>                 URL: https://issues.apache.org/jira/browse/TIKA-1568
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.7
>            Reporter: Andrzej Bialecki 
>
> Parsing performance of many text files suffers from repeated calls to 
> ServiceLoader.loadServiceProviders(EncodingDetector.class). This happens in 
> TXTParser, HTMLParser and SourceCodeParser. In most cases, when Tika is using 
> the default ServiceLoader instance created in the Parser's static section 
> this cost can be avoided by caching the resulting List<EncodingDetector> 
> either at a higher level in the Parser (as a static property). If using 
> custom ServiceLoader-s this can be achieved by putting this list in 
> ParsingContext, or caching these lists at a lower level in the ServiceLoader 
> component.
> Relevant part of  the stacktrace follows:
> {code}
>    java.lang.Thread.State: BLOCKED (on object monitor)
>       at java.util.zip.ZipFile.getEntry(ZipFile.java:304)
>       - locked <0x00000007909d2e48> (a java.util.jar.JarFile)
>       at java.util.jar.JarFile.getEntry(JarFile.java:227)
>       at java.util.jar.JarFile.getJarEntry(JarFile.java:210)
>       at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:840)
>       at sun.misc.URLClassPath$JarLoader.findResource(URLClassPath.java:818)
>       at sun.misc.URLClassPath$1.next(URLClassPath.java:226)
>       at sun.misc.URLClassPath$1.hasMoreElements(URLClassPath.java:236)
>       at java.net.URLClassLoader$3$1.run(URLClassLoader.java:583)
>       at java.net.URLClassLoader$3$1.run(URLClassLoader.java:581)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at java.net.URLClassLoader$3.next(URLClassLoader.java:580)
>       at java.net.URLClassLoader$3.hasMoreElements(URLClassLoader.java:605)
>       at java.util.Collections.list(Collections.java:3687)
>       at 
> org.eclipse.jetty.webapp.WebAppClassLoader.toList(WebAppClassLoader.java:337)
>       at 
> org.eclipse.jetty.webapp.WebAppClassLoader.getResources(WebAppClassLoader.java:321)
>       at 
> org.apache.tika.config.ServiceLoader.findServiceResources(ServiceLoader.java:210)
>       at 
> org.apache.tika.config.ServiceLoader.identifyStaticServiceProviders(ServiceLoader.java:277)
>       at 
> org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:306)
>       at 
> org.apache.tika.config.ServiceLoader.loadServiceProviders(ServiceLoader.java:228)
>       at 
> org.apache.tika.detect.AutoDetectReader.<init>(AutoDetectReader.java:104)
>       at org.apache.tika.parser.txt.TXTParser.parse(TXTParser.java:70)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
>       at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to