[ https://issues.apache.org/jira/browse/NUTCH-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16510788#comment-16510788 ]
Sebastian Nagel edited comment on NUTCH-2578 at 6/18/18 4:19 PM: ----------------------------------------------------------------- Merged. I'll test this fix and also TIKA-2645 (using tika-core 1.19-SNAPSHOT) during the next two days. Thanks, [~yossi] and [~talli...@apache.org]! was (Author: wastl-nagel): Merged. I'll test this fix and also TIKA-2658 (using tika-core 1.19-SNAPSHOT) during the next two days. Thanks, [~yossi] and [~talli...@apache.org]! > Avoid lock by MimeUtil in constructor of protocol.Content > --------------------------------------------------------- > > Key: NUTCH-2578 > URL: https://issues.apache.org/jira/browse/NUTCH-2578 > Project: Nutch > Issue Type: Improvement > Components: protocol > Affects Versions: 1.14 > Reporter: Sebastian Nagel > Assignee: Sebastian Nagel > Priority: Major > Fix For: 1.15 > > > The constructor of the class o.a.n.protocol.Content instantiates a new > MimeUtil object. That's not cheap as it always creates a new Tika object and > there is a lock on the job/jar file when config files are read: > {noformat} > "FetcherThread" #146 daemon prio=5 os_prio=0 tid=0x00007f70523c3800 > nid=0x1de2 waiting for monitor entry [0x00007f70193a8000] > java.lang.Thread.State: BLOCKED (on object monitor) > at java.util.zip.ZipFile.getEntry(ZipFile.java:314) > - waiting to lock <0x00000005e0285758> (a java.util.jar.JarFile) > at java.util.jar.JarFile.getEntry(JarFile.java:240) > at java.util.jar.JarFile.getJarEntry(JarFile.java:223) > at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:1042) > at > sun.misc.URLClassPath$JarLoader.findResource(URLClassPath.java:1020) > at sun.misc.URLClassPath$1.next(URLClassPath.java:267) > at sun.misc.URLClassPath$1.hasMoreElements(URLClassPath.java:277) > at java.net.URLClassLoader$3$1.run(URLClassLoader.java:601) > at java.net.URLClassLoader$3$1.run(URLClassLoader.java:599) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader$3.next(URLClassLoader.java:598) > at java.net.URLClassLoader$3.hasMoreElements(URLClassLoader.java:623) > at sun.misc.CompoundEnumeration.next(CompoundEnumeration.java:45) > at > sun.misc.CompoundEnumeration.hasMoreElements(CompoundEnumeration.java:54) > at java.util.Collections.list(Collections.java:5239) > at > org.apache.tika.config.ServiceLoader.identifyStaticServiceProviders(ServiceLoader.java:325) > at > org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:352) > at > org.apache.tika.config.ServiceLoader.loadServiceProviders(ServiceLoader.java:274) > at > org.apache.tika.detect.DefaultEncodingDetector.<init>(DefaultEncodingDetector.java:45) > at > org.apache.tika.config.TikaConfig.getDefaultEncodingDetector(TikaConfig.java:92) > at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:248) > at > org.apache.tika.config.TikaConfig.getDefaultConfig(TikaConfig.java:386) > at org.apache.tika.Tika.<init>(Tika.java:116) > at org.apache.nutch.util.MimeUtil.<init>(MimeUtil.java:69) > at org.apache.nutch.protocol.Content.<init>(Content.java:83) > at > org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:316) > at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:341) > {noformat} > If there are many Fetcher threads this may cause a significant bottleneck, > running a Fetcher with 120 threads I've found up to 50 threads waiting for > this lock: > {noformat} > # pid 7195 is a Fetcher map task > % sudo -u yarn jstack 7195 \ > | grep -A25 'waiting to lock' \ > | grep -F 'org.apache.tika.Tika.<init>' \ > | wc -l > 49 > {noformat} > As MimeUtil is thread-safe [including the called Tika > detector|https://www.mail-archive.com/user@tika.apache.org/msg00296.html], > the best solution seems to cache the MimeUtil object in the actual protocol > implementation as it is done in Nutch 2.x ([lib-http HttpBase, line > #151|https://github.com/apache/nutch/blob/2.x/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java#L151]). -- This message was sent by Atlassian JIRA (v7.6.3#76005)