[
https://issues.apache.org/jira/browse/NUTCH-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16516005#comment-16516005
]
Sebastian Nagel commented on NUTCH-2578:
----------------------------------------
The results with 8 randomly selected Java stack dumps for all Fetcher threads:
- with fixed issues: TIKA-2645, NUTCH-2578, NUTCH-2579:
{noformat}
496 at java.net.SocketInputStream.socketRead0(Native Method)
121 at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
107 at java.net.PlainSocketImpl.socketConnect(Native Method)
53 at sun.misc.Unsafe.park(Native Method)
6 at
java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3799)
5 at java.util.regex.Pattern$CharProperty.match(Pattern.java:3778)
3 at java.lang.Character.codePointAt(Character.java:4866)
2 at sun.security.ec.ECDSASignature.verifySignedDigest(Native
Method)
1 at java.net.URL.<init>(URL.java:540)
1 at java.util.regex.Pattern$Curly.match0(Pattern.java:4252)
1 at java.util.regex.Pattern$Curly.match0(Pattern.java:4262)
1 at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1077)
1 at org.apache.tika.mime.MagicMatch.eval(MagicMatch.java:61)
1 at sun.nio.cs.UTF_8$Encoder.encodeLoop(UTF_8.java:691)
1 at sun.nio.cs.UTF_8.newEncoder(UTF_8.java:72)
{noformat}
Fetcher threads are spending their time what they're supposed to: reading from
or connecting sockets, resolve host names to IPs. The "parking" threads are
waiting for the parsers (fetcher.parse == true).
- compared with the stack dump counts taken one month ago:
{noformat}
262 at java.net.SocketInputStream.socketRead0(Native Method)
158 at java.util.zip.ZipFile.getEntry(ZipFile.java:314)
136 at java.net.PlainSocketImpl.socketConnect(Native Method)
124 at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
38 at sun.misc.Unsafe.park(Native Method)
27 at java.util.regex.Pattern$CharProperty.match(Pattern.java:3778)
21 at
java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3799)
10 at sun.misc.URLClassPath.getLookupCache(URLClassPath.java:396)
7 at java.util.regex.Pattern$Curly.match0(Pattern.java:4274)
7 at java.util.regex.Pattern$GroupHead.match(Pattern.java:4660)
5 at org.apache.xerces.impl.XMLScanner.scanExternalID(Unknown
Source)
4 at sun.misc.Unsafe.unpark(Native Method)
4 at sun.security.ec.ECDHKeyAgreement.deriveKey(Native Method)
3 at java.util.regex.Pattern$Curly.match0(Pattern.java:4262)
3 at java.util.zip.ZipFile.getEntry(Native Method)
2 at java.lang.ClassLoader.loadClass(ClassLoader.java:404)
2 at java.net.URI$Parser.scan(URI.java:2998)
2 at java.util.Arrays.copyOf(Arrays.java:3236)
2 at java.util.HashMap.hash(HashMap.java:339)
2 at java.util.regex.Pattern$Curly.match0(Pattern.java:4270)
2 at java.util.zip.ZipCoder.getBytes(ZipCoder.java:80)
2 at sun.misc.URLClassPath.getNextLoader(URLClassPath.java:469)
1 at java.io.UnixFileSystem.getLastModifiedTime(Native Method)
1 at
java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:68)
1 at java.lang.ClassLoader.findLoadedClass0(Native Method)
1 at java.lang.Object.hashCode(Native Method)
1 at java.lang.String.toCharArray(String.java:2899)
1 at java.lang.String.toLowerCase(String.java:2586)
1 at java.lang.Thread.setPriority0(Native Method)
1 at java.math.BigInteger.oddModPow(BigInteger.java:2840)
1 at java.net.URL.<init>(URL.java:540)
1 at java.net.URL.<init>(URL.java:599)
1 at java.util.LinkedHashMap.keySet(LinkedHashMap.java:533)
1 at java.util.regex.Pattern$Curly.match(Pattern.java:4229)
1 at java.util.zip.Inflater.init(Native Method)
1 at
org.apache.nutch.protocol.okhttp.OkHttpResponse.toByteArray(OkHttpResponse.java:158)
1 at
org.apache.nutch.util.PrefixStringMatcher.shortestMatch(PrefixStringMatcher.java:78)
1 at
org.apache.tika.detect.MagicDetector.detect(MagicDetector.java:374)
1 at sun.security.ec.ECDSASignature.verifySignedDigest(Native
Method)
{noformat}
Both counts are prepared using the recent Nutch master and protocol-okhttp
(NUTCH-2576). Only the "deepest" lines of the thread stacks are counted by the
command:
{{jstack <pid_of_fetcher_task> | grep -h -A2 '^"FetcherThread"' | grep -v
'^"FetcherThread"' | grep -v 'State:' | sort | uniq -c | sort -k1,1nr}}
Of course, 8 stack dumps are not super-representative, but looks like all locks
on the Nutch job file are fixed now.
> Avoid lock by MimeUtil in constructor of protocol.Content
> ---------------------------------------------------------
>
> Key: NUTCH-2578
> URL: https://issues.apache.org/jira/browse/NUTCH-2578
> Project: Nutch
> Issue Type: Improvement
> Components: protocol
> Affects Versions: 1.14
> Reporter: Sebastian Nagel
> Assignee: Sebastian Nagel
> Priority: Major
> Fix For: 1.15
>
>
> The constructor of the class o.a.n.protocol.Content instantiates a new
> MimeUtil object. That's not cheap as it always creates a new Tika object and
> there is a lock on the job/jar file when config files are read:
> {noformat}
> "FetcherThread" #146 daemon prio=5 os_prio=0 tid=0x00007f70523c3800
> nid=0x1de2 waiting for monitor entry [0x00007f70193a8000]
> java.lang.Thread.State: BLOCKED (on object monitor)
> at java.util.zip.ZipFile.getEntry(ZipFile.java:314)
> - waiting to lock <0x00000005e0285758> (a java.util.jar.JarFile)
> at java.util.jar.JarFile.getEntry(JarFile.java:240)
> at java.util.jar.JarFile.getJarEntry(JarFile.java:223)
> at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:1042)
> at
> sun.misc.URLClassPath$JarLoader.findResource(URLClassPath.java:1020)
> at sun.misc.URLClassPath$1.next(URLClassPath.java:267)
> at sun.misc.URLClassPath$1.hasMoreElements(URLClassPath.java:277)
> at java.net.URLClassLoader$3$1.run(URLClassLoader.java:601)
> at java.net.URLClassLoader$3$1.run(URLClassLoader.java:599)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader$3.next(URLClassLoader.java:598)
> at java.net.URLClassLoader$3.hasMoreElements(URLClassLoader.java:623)
> at sun.misc.CompoundEnumeration.next(CompoundEnumeration.java:45)
> at
> sun.misc.CompoundEnumeration.hasMoreElements(CompoundEnumeration.java:54)
> at java.util.Collections.list(Collections.java:5239)
> at
> org.apache.tika.config.ServiceLoader.identifyStaticServiceProviders(ServiceLoader.java:325)
> at
> org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:352)
> at
> org.apache.tika.config.ServiceLoader.loadServiceProviders(ServiceLoader.java:274)
> at
> org.apache.tika.detect.DefaultEncodingDetector.<init>(DefaultEncodingDetector.java:45)
> at
> org.apache.tika.config.TikaConfig.getDefaultEncodingDetector(TikaConfig.java:92)
> at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:248)
> at
> org.apache.tika.config.TikaConfig.getDefaultConfig(TikaConfig.java:386)
> at org.apache.tika.Tika.<init>(Tika.java:116)
> at org.apache.nutch.util.MimeUtil.<init>(MimeUtil.java:69)
> at org.apache.nutch.protocol.Content.<init>(Content.java:83)
> at
> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:316)
> at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:341)
> {noformat}
> If there are many Fetcher threads this may cause a significant bottleneck,
> running a Fetcher with 120 threads I've found up to 50 threads waiting for
> this lock:
> {noformat}
> # pid 7195 is a Fetcher map task
> % sudo -u yarn jstack 7195 \
> | grep -A25 'waiting to lock' \
> | grep -F 'org.apache.tika.Tika.<init>' \
> | wc -l
> 49
> {noformat}
> As MimeUtil is thread-safe [including the called Tika
> detector|https://www.mail-archive.com/[email protected]/msg00296.html],
> the best solution seems to cache the MimeUtil object in the actual protocol
> implementation as it is done in Nutch 2.x ([lib-http HttpBase, line
> #151|https://github.com/apache/nutch/blob/2.x/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java#L151]).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)