[ 
https://issues.apache.org/jira/browse/NUTCH-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16499137#comment-16499137
 ] 

ASF GitHub Bot commented on NUTCH-2578:
---------------------------------------

sebastian-nagel opened a new pull request #338: NUTCH-2578 Avoid lock by 
MimeUtil in constructor of protocol.Content
URL: https://github.com/apache/nutch/pull/338
 
 
   - avoid creation of new Tika object in constructor of MimeUtil (cache in 
ObjectCache) - suggested by @YossiTamari. Thanks!
   - also hold MimeUtil object in protocol implementation (protocol plugin) and 
pass it to constructor of Content. This avoids any locks (also those by 
ObjectCache) in a highly concurrent fetcher. Note that the current MimeUtil 
implementation and the Tika detector are thread-safe.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> Avoid lock by MimeUtil in constructor of protocol.Content
> ---------------------------------------------------------
>
>                 Key: NUTCH-2578
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2578
>             Project: Nutch
>          Issue Type: Improvement
>          Components: protocol
>    Affects Versions: 1.14
>            Reporter: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.15
>
>
> The constructor of the class o.a.n.protocol.Content instantiates a new 
> MimeUtil object. That's not cheap as it always creates a new Tika object and 
> there is a lock on the job/jar file when config files are read:
> {noformat}
> "FetcherThread" #146 daemon prio=5 os_prio=0 tid=0x00007f70523c3800 
> nid=0x1de2 waiting for monitor entry [0x00007f70193a8000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>         at java.util.zip.ZipFile.getEntry(ZipFile.java:314)
>         - waiting to lock <0x00000005e0285758> (a java.util.jar.JarFile)
>         at java.util.jar.JarFile.getEntry(JarFile.java:240)
>         at java.util.jar.JarFile.getJarEntry(JarFile.java:223)
>         at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:1042)
>         at 
> sun.misc.URLClassPath$JarLoader.findResource(URLClassPath.java:1020)
>         at sun.misc.URLClassPath$1.next(URLClassPath.java:267)
>         at sun.misc.URLClassPath$1.hasMoreElements(URLClassPath.java:277)
>         at java.net.URLClassLoader$3$1.run(URLClassLoader.java:601)
>         at java.net.URLClassLoader$3$1.run(URLClassLoader.java:599)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at java.net.URLClassLoader$3.next(URLClassLoader.java:598)
>         at java.net.URLClassLoader$3.hasMoreElements(URLClassLoader.java:623)
>         at sun.misc.CompoundEnumeration.next(CompoundEnumeration.java:45)
>         at 
> sun.misc.CompoundEnumeration.hasMoreElements(CompoundEnumeration.java:54)
>         at java.util.Collections.list(Collections.java:5239)
>         at 
> org.apache.tika.config.ServiceLoader.identifyStaticServiceProviders(ServiceLoader.java:325)
>         at 
> org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:352)
>         at 
> org.apache.tika.config.ServiceLoader.loadServiceProviders(ServiceLoader.java:274)
>         at 
> org.apache.tika.detect.DefaultEncodingDetector.<init>(DefaultEncodingDetector.java:45)
>         at 
> org.apache.tika.config.TikaConfig.getDefaultEncodingDetector(TikaConfig.java:92)
>         at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:248)
>         at 
> org.apache.tika.config.TikaConfig.getDefaultConfig(TikaConfig.java:386)
>         at org.apache.tika.Tika.<init>(Tika.java:116)
>         at org.apache.nutch.util.MimeUtil.<init>(MimeUtil.java:69)
>         at org.apache.nutch.protocol.Content.<init>(Content.java:83)
>         at 
> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:316)
>         at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:341)
> {noformat}
> If there are many Fetcher threads this may cause a significant bottleneck, 
> running a Fetcher with 120 threads I've found up to 50 threads waiting for 
> this lock:
> {noformat}
> # pid 7195 is a Fetcher map task
> % sudo -u yarn jstack 7195 \
>       | grep -A25 'waiting to lock' \
>       | grep -F 'org.apache.tika.Tika.<init>' \
>       | wc -l
> 49
> {noformat}
> As MimeUtil is thread-safe [including the called Tika 
> detector|https://www.mail-archive.com/[email protected]/msg00296.html], 
> the best solution seems to cache the MimeUtil object in the actual protocol 
> implementation as it is done in Nutch 2.x ([lib-http HttpBase, line 
> #151|https://github.com/apache/nutch/blob/2.x/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java#L151]).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to