[ 
https://issues.apache.org/jira/browse/NUTCH-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16479158#comment-16479158
 ] 

Sebastian Nagel commented on NUTCH-2578:
----------------------------------------

Hi [~yossi], the ObjectCache only avoids that configured object are created and 
configured multiple times. It does not make anything thread-safe, or do I miss 
something? Of course, if the MimeUtil is cached itself it does make less sense 
to use the object cache (see NUTCH-618) for the MimeTypes object. It does not 
harm, however. Regarding thread-safety: yes, we should make sure that the 
MimeUtil methods are thread-safe. If Tika.detect() safe (I'll ask the Tika 
people whether it really is), I would just document this as a requirement. Does 
this make sense?

 Meanwhile, I'm testing [a 
fix|https://github.com/commoncrawl/nutch/commit/107a6a26acfe97fc61f3b589ae88e4228d181c83]
 with 48 fetcher tasks each 120 threads. No issues so far. However, there is 
another lock in the MimeTypes.detect() method which now makes up to 30 threads 
waiting. It was there before but in only max. 4 threads. I'll have to look for 
it as well.
{noformat}
"FetcherThread" #146 daemon prio=5 os_prio=0 tid=0x00007f21eebd1800 nid=0x5cc4 
waiting for monitor entry [0x00007f21b3f45000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at java.util.zip.ZipFile.getEntry(ZipFile.java:315)
        - waiting to lock <0x00000005e03245b8> (a java.util.jar.JarFile)
        at java.util.jar.JarFile.getEntry(JarFile.java:240)
        at java.util.jar.JarFile.getJarEntry(JarFile.java:223)
        at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:1042)
        at sun.misc.URLClassPath$JarLoader.findResource(URLClassPath.java:1020)
        at sun.misc.URLClassPath.findResource(URLClassPath.java:215)
        at java.net.URLClassLoader$2.run(URLClassLoader.java:569)
        at java.net.URLClassLoader$2.run(URLClassLoader.java:567)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findResource(URLClassLoader.java:566)
        at java.lang.ClassLoader.getResource(ClassLoader.java:1096)
        at java.net.URLClassLoader.getResourceAsStream(URLClassLoader.java:232)
        at org.apache.xerces.parsers.SecuritySupport$6.run(Unknown Source)
        at java.security.AccessController.doPrivileged(Native Method)
        at 
org.apache.xerces.parsers.SecuritySupport.getResourceAsStream(Unknown Source)
        at 
org.apache.xerces.parsers.ObjectFactory.findJarServiceProvider(Unknown Source)
        at org.apache.xerces.parsers.ObjectFactory.createObject(Unknown Source)
        at org.apache.xerces.parsers.ObjectFactory.createObject(Unknown Source)
        at org.apache.xerces.parsers.SAXParser.<init>(Unknown Source)
        at org.apache.xerces.parsers.SAXParser.<init>(Unknown Source)
        at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.<init>(Unknown 
Source)
        at org.apache.xerces.jaxp.SAXParserImpl.<init>(Unknown Source)
        at org.apache.xerces.jaxp.SAXParserFactoryImpl.newSAXParser(Unknown 
Source)
        at 
org.apache.tika.detect.XmlRootExtractor.extractRootElement(XmlRootExtractor.java:62)
        at 
org.apache.tika.detect.XmlRootExtractor.extractRootElement(XmlRootExtractor.java:42)
        at org.apache.tika.mime.MimeTypes.getMimeType(MimeTypes.java:212)
        at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:494)
        at 
org.apache.nutch.util.MimeUtil.autoResolveContentType(MimeUtil.java:193)
        at org.apache.nutch.protocol.Content.getContentType(Content.java:310)
        at org.apache.nutch.protocol.Content.<init>(Content.java:107)
        at 
org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:321)
        at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:341)
{noformat}

> Avoid lock by MimeUtil in constructor of protocol.Content
> ---------------------------------------------------------
>
>                 Key: NUTCH-2578
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2578
>             Project: Nutch
>          Issue Type: Improvement
>          Components: protocol
>    Affects Versions: 1.14
>            Reporter: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.15
>
>
> The constructor of the class o.a.n.protocol.Content instantiates a new 
> MimeUtil object. That's not cheap as it always creates a new tika.MimeTypes 
> object and there is a lock on the job/jar file when config files are read:
> {noformat}
> "FetcherThread" #146 daemon prio=5 os_prio=0 tid=0x00007f70523c3800 
> nid=0x1de2 waiting for monitor entry [0x00007f70193a8000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>         at java.util.zip.ZipFile.getEntry(ZipFile.java:314)
>         - waiting to lock <0x00000005e0285758> (a java.util.jar.JarFile)
>         at java.util.jar.JarFile.getEntry(JarFile.java:240)
>         at java.util.jar.JarFile.getJarEntry(JarFile.java:223)
>         at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:1042)
>         at 
> sun.misc.URLClassPath$JarLoader.findResource(URLClassPath.java:1020)
>         at sun.misc.URLClassPath$1.next(URLClassPath.java:267)
>         at sun.misc.URLClassPath$1.hasMoreElements(URLClassPath.java:277)
>         at java.net.URLClassLoader$3$1.run(URLClassLoader.java:601)
>         at java.net.URLClassLoader$3$1.run(URLClassLoader.java:599)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at java.net.URLClassLoader$3.next(URLClassLoader.java:598)
>         at java.net.URLClassLoader$3.hasMoreElements(URLClassLoader.java:623)
>         at sun.misc.CompoundEnumeration.next(CompoundEnumeration.java:45)
>         at 
> sun.misc.CompoundEnumeration.hasMoreElements(CompoundEnumeration.java:54)
>         at java.util.Collections.list(Collections.java:5239)
>         at 
> org.apache.tika.config.ServiceLoader.identifyStaticServiceProviders(ServiceLoader.java:325)
>         at 
> org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:352)
>         at 
> org.apache.tika.config.ServiceLoader.loadServiceProviders(ServiceLoader.java:274)
>         at 
> org.apache.tika.detect.DefaultEncodingDetector.<init>(DefaultEncodingDetector.java:45)
>         at 
> org.apache.tika.config.TikaConfig.getDefaultEncodingDetector(TikaConfig.java:92)
>         at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:248)
>         at 
> org.apache.tika.config.TikaConfig.getDefaultConfig(TikaConfig.java:386)
>         at org.apache.tika.Tika.<init>(Tika.java:116)
>         at org.apache.nutch.util.MimeUtil.<init>(MimeUtil.java:69)
>         at org.apache.nutch.protocol.Content.<init>(Content.java:83)
>         at 
> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:316)
>         at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:341)
> {noformat}
> If there are many Fetcher threads this may cause a significant bottleneck, 
> running a Fetcher with 120 threads I've found up to 50 threads waiting for 
> this lock:
> {noformat}
> # pid 7195 is a Fetcher map task
> % sudo -u yarn jstack 7195 \
>       | grep -A25 'waiting to lock' \
>       | grep -F 'org.apache.tika.Tika.<init>' \
>       | wc -l
> 49
> {noformat}
> As MimeUtil is thread-safe [including the called Tika 
> detector|https://www.mail-archive.com/[email protected]/msg00296.html], 
> the best solution seems to cache the MimeUtil object in the actual protocol 
> implementation as it is done in Nutch 2.x ([lib-http HttpBase, line 
> #151|https://github.com/apache/nutch/blob/2.x/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java#L151]).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to