[
https://issues.apache.org/jira/browse/NUTCH-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16479158#comment-16479158
]
Sebastian Nagel commented on NUTCH-2578:
----------------------------------------
Hi [~yossi], the ObjectCache only avoids that configured object are created and
configured multiple times. It does not make anything thread-safe, or do I miss
something? Of course, if the MimeUtil is cached itself it does make less sense
to use the object cache (see NUTCH-618) for the MimeTypes object. It does not
harm, however. Regarding thread-safety: yes, we should make sure that the
MimeUtil methods are thread-safe. If Tika.detect() safe (I'll ask the Tika
people whether it really is), I would just document this as a requirement. Does
this make sense?
 Meanwhile, I'm testing [a
fix|https://github.com/commoncrawl/nutch/commit/107a6a26acfe97fc61f3b589ae88e4228d181c83]
with 48 fetcher tasks each 120 threads. No issues so far. However, there is
another lock in the MimeTypes.detect() method which now makes up to 30 threads
waiting. It was there before but in only max. 4 threads. I'll have to look for
it as well.
{noformat}
"FetcherThread" #146 daemon prio=5 os_prio=0 tid=0x00007f21eebd1800 nid=0x5cc4
waiting for monitor entry [0x00007f21b3f45000]
java.lang.Thread.State: BLOCKED (on object monitor)
at java.util.zip.ZipFile.getEntry(ZipFile.java:315)
- waiting to lock <0x00000005e03245b8> (a java.util.jar.JarFile)
at java.util.jar.JarFile.getEntry(JarFile.java:240)
at java.util.jar.JarFile.getJarEntry(JarFile.java:223)
at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:1042)
at sun.misc.URLClassPath$JarLoader.findResource(URLClassPath.java:1020)
at sun.misc.URLClassPath.findResource(URLClassPath.java:215)
at java.net.URLClassLoader$2.run(URLClassLoader.java:569)
at java.net.URLClassLoader$2.run(URLClassLoader.java:567)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findResource(URLClassLoader.java:566)
at java.lang.ClassLoader.getResource(ClassLoader.java:1096)
at java.net.URLClassLoader.getResourceAsStream(URLClassLoader.java:232)
at org.apache.xerces.parsers.SecuritySupport$6.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at
org.apache.xerces.parsers.SecuritySupport.getResourceAsStream(Unknown Source)
at
org.apache.xerces.parsers.ObjectFactory.findJarServiceProvider(Unknown Source)
at org.apache.xerces.parsers.ObjectFactory.createObject(Unknown Source)
at org.apache.xerces.parsers.ObjectFactory.createObject(Unknown Source)
at org.apache.xerces.parsers.SAXParser.<init>(Unknown Source)
at org.apache.xerces.parsers.SAXParser.<init>(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.<init>(Unknown
Source)
at org.apache.xerces.jaxp.SAXParserImpl.<init>(Unknown Source)
at org.apache.xerces.jaxp.SAXParserFactoryImpl.newSAXParser(Unknown
Source)
at
org.apache.tika.detect.XmlRootExtractor.extractRootElement(XmlRootExtractor.java:62)
at
org.apache.tika.detect.XmlRootExtractor.extractRootElement(XmlRootExtractor.java:42)
at org.apache.tika.mime.MimeTypes.getMimeType(MimeTypes.java:212)
at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:494)
at
org.apache.nutch.util.MimeUtil.autoResolveContentType(MimeUtil.java:193)
at org.apache.nutch.protocol.Content.getContentType(Content.java:310)
at org.apache.nutch.protocol.Content.<init>(Content.java:107)
at
org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:321)
at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:341)
{noformat}
> Avoid lock by MimeUtil in constructor of protocol.Content
> ---------------------------------------------------------
>
> Key: NUTCH-2578
> URL: https://issues.apache.org/jira/browse/NUTCH-2578
> Project: Nutch
> Issue Type: Improvement
> Components: protocol
> Affects Versions: 1.14
> Reporter: Sebastian Nagel
> Priority: Major
> Fix For: 1.15
>
>
> The constructor of the class o.a.n.protocol.Content instantiates a new
> MimeUtil object. That's not cheap as it always creates a new tika.MimeTypes
> object and there is a lock on the job/jar file when config files are read:
> {noformat}
> "FetcherThread" #146 daemon prio=5 os_prio=0 tid=0x00007f70523c3800
> nid=0x1de2 waiting for monitor entry [0x00007f70193a8000]
> java.lang.Thread.State: BLOCKED (on object monitor)
> at java.util.zip.ZipFile.getEntry(ZipFile.java:314)
> - waiting to lock <0x00000005e0285758> (a java.util.jar.JarFile)
> at java.util.jar.JarFile.getEntry(JarFile.java:240)
> at java.util.jar.JarFile.getJarEntry(JarFile.java:223)
> at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:1042)
> at
> sun.misc.URLClassPath$JarLoader.findResource(URLClassPath.java:1020)
> at sun.misc.URLClassPath$1.next(URLClassPath.java:267)
> at sun.misc.URLClassPath$1.hasMoreElements(URLClassPath.java:277)
> at java.net.URLClassLoader$3$1.run(URLClassLoader.java:601)
> at java.net.URLClassLoader$3$1.run(URLClassLoader.java:599)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader$3.next(URLClassLoader.java:598)
> at java.net.URLClassLoader$3.hasMoreElements(URLClassLoader.java:623)
> at sun.misc.CompoundEnumeration.next(CompoundEnumeration.java:45)
> at
> sun.misc.CompoundEnumeration.hasMoreElements(CompoundEnumeration.java:54)
> at java.util.Collections.list(Collections.java:5239)
> at
> org.apache.tika.config.ServiceLoader.identifyStaticServiceProviders(ServiceLoader.java:325)
> at
> org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:352)
> at
> org.apache.tika.config.ServiceLoader.loadServiceProviders(ServiceLoader.java:274)
> at
> org.apache.tika.detect.DefaultEncodingDetector.<init>(DefaultEncodingDetector.java:45)
> at
> org.apache.tika.config.TikaConfig.getDefaultEncodingDetector(TikaConfig.java:92)
> at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:248)
> at
> org.apache.tika.config.TikaConfig.getDefaultConfig(TikaConfig.java:386)
> at org.apache.tika.Tika.<init>(Tika.java:116)
> at org.apache.nutch.util.MimeUtil.<init>(MimeUtil.java:69)
> at org.apache.nutch.protocol.Content.<init>(Content.java:83)
> at
> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:316)
> at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:341)
> {noformat}
> If there are many Fetcher threads this may cause a significant bottleneck,
> running a Fetcher with 120 threads I've found up to 50 threads waiting for
> this lock:
> {noformat}
> # pid 7195 is a Fetcher map task
> % sudo -u yarn jstack 7195 \
> | grep -A25 'waiting to lock' \
> | grep -F 'org.apache.tika.Tika.<init>' \
> | wc -l
> 49
> {noformat}
> As MimeUtil is thread-safe [including the called Tika
> detector|https://www.mail-archive.com/[email protected]/msg00296.html],
> the best solution seems to cache the MimeUtil object in the actual protocol
> implementation as it is done in Nutch 2.x ([lib-http HttpBase, line
> #151|https://github.com/apache/nutch/blob/2.x/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java#L151]).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)