Sebastian Nagel created NUTCH-3067:
--------------------------------------
Summary: Improve performance of FetchItemQueues if error state is
preserved
Key: NUTCH-3067
URL: https://issues.apache.org/jira/browse/NUTCH-3067
Project: Nutch
Issue Type: Bug
Components: fetcher
Affects Versions: 1.20
Reporter: Sebastian Nagel
Assignee: Sebastian Nagel
Fix For: 1.21
Attachments: Screenshot_20240905_101623_fetcher_tasks_many_queues.png,
fetcher.map.20240711113925.925750.flamegraph.html
In certain cases the error state of a fetch queue needs to be
preserved, even if the queue is (currently) empty, because there might
be still URLs in the fetcher input not yet read by the QueueFeeder,
see NUTCH-2947. To keep the queue together with its state is necessary
- to skip queues together with all items queued now or to be queued
later by the QueueFeeder, if a queue exceeds the maximum configured
number of exceptions (NUTCH-769). This is mostly a performance feature,
but with implications for politeness because also HTTP 403 Forbidden
(and similar) are counted as "exceptions".
- to implement an exponential backoff which slows down the fetching from sites
responding with repeated "exceptions" (NUTCH-2946).
However, there is a drawback when all "stateful" queues are preserved
until the QueueFeeder has finished reading input fetch lists: Nutch's
fetch queue implementation becomes slow if there are too many queues.
This situation / issue was observed in the first cycle of a crawl
where only the homepages of millions of sites were fetched:
- about 1 million homepages per fetcher task
- about 25% of the homepage URLs caused exceptions - the fetch lists was not
filtered beforehand whether a site is reachable and is responding
- consequently, after a certain amount of time (3-4 hours) 250k queues per task
were "stateful" and preserved until the fetch list input was entirely read by
the QueueFeeder
- with too many queues and most of them empty (no URLs) the operations on the
queues become slow and fetching almost stale (see screenshot)
- many queues but few URLs queued (250k vs. 25)
- most fetcher threads (190 out of 240) waiting for the lock on one of the
synchronized methods of FetchItemQueues
- also the QueueFeeder is affected by the lock which explains why only few
URLs are queued
Important notes: this is not an issue
- if no error state is preserved, that is if {{fetcher.max.exceptions.per.queue
== -1}} and {{fetcher.exceptions.per.queue.delay == 0.0}}
- or if the crawl isn't too "broad" in terms of the number of different hosts
(domains or IPs, depending on {{fetcher.queue.mode}})
As possible solutions:
1. do not keep every stateful queue: drop queues which have a low exception
count after a configurable amount of time. If a second URL from the same
host/domain/IP is fetched after a considerably long time span (eg. 30 minutes),
the effect on performance and politeness should be negligible.
2. review the implementation of FetchItemQueues and the locking (synchronized
methods)
3. at least, try to prioritize QueueFeeder, for example by a method which adds
multiple fetch items within one synchronized call
Details and data:
Screenshot of the Fetcher map task status in the Hadoop YARN Web UI (attached)
Counts of the top (deepest) line in the stack traces of all Fetcher threads:
{noformat}
120 at
org.apache.nutch.fetcher.FetchItemQueues.getFetchItem(FetchItemQueues.java:177)
49 at
org.apache.nutch.fetcher.FetchItemQueues.checkExceptionThreshold(FetchItemQueues.java:281)
21 at
org.apache.nutch.fetcher.FetchItemQueues.getFetchItemQueue(FetchItemQueues.java:166)
19 at
java.net.PlainSocketImpl.socketConnect([email protected]/Native Method)
18 at
java.net.SocketInputStream.socketRead0([email protected]/Native Method)
6 at java.lang.Object.wait([email protected]/Native Method) #
waiting for HTTP/2 stream
4 at java.lang.Thread.sleep([email protected]/Native Method)
2 at
java.net.Inet4AddressImpl.lookupAllHostAddr([email protected]/Native Method)
1 at
java.util.Collections$SynchronizedCollection.size([email protected]/Collections.java:2017)
{noformat}
Full stack traces (three examples):
{noformat}
"FetcherThread" #38 daemon prio=5 os_prio=0 cpu=43743.17ms elapsed=15890.29s
tid=0x0000752967fff800 nid=0x83a3c waiting for monitor entry
[0x000075292fcf9000]
java.lang.Thread.State: BLOCKED (on object monitor)
at
org.apache.nutch.fetcher.FetchItemQueues.getFetchItem(FetchItemQueues.java:177)
- waiting to lock <0x000000066894b9d8> (a
org.apache.nutch.fetcher.FetchItemQueues)
at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:301)
"FetcherThread" #72 daemon prio=5 os_prio=0 cpu=38381.67ms elapsed=15881.02s
tid=0x000075292822d000 nid=0x83a91 waiting for monitor entry
[0x0000752926cfe000]
java.lang.Thread.State: BLOCKED (on object monitor)
at
org.apache.nutch.fetcher.FetchItemQueues.checkExceptionThreshold(FetchItemQueues.java:281)
- waiting to lock <0x000000066894b9d8> (a
org.apache.nutch.fetcher.FetchItemQueues)
at
org.apache.nutch.fetcher.FetchItemQueues.checkExceptionThreshold(FetchItemQueues.java:338)
at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:489)
"FetcherThread" #43 daemon prio=5 os_prio=0 cpu=39112.96ms elapsed=15889.09s
tid=0x0000752928361000 nid=0x83a41 waiting for monitor entry
[0x000075292d65f000]
java.lang.Thread.State: BLOCKED (on object monitor)
at
org.apache.nutch.fetcher.FetchItemQueues.getFetchItemQueue(FetchItemQueues.java:166)
- waiting to lock <0x000000066894b9d8> (a
org.apache.nutch.fetcher.FetchItemQueues)
at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:345)
{noformat}
Stack of the blocked QueueFeeder:
{noformat}
"QueueFeeder" #31 daemon prio=5 os_prio=0 cpu=19415.88ms elapsed=15926.65s
tid=0x000075296780c800 nid=0x83a30 waiting for monitor entry
[0x000075292fff9000]
java.lang.Thread.State: BLOCKED (on object monitor)
at
org.apache.nutch.fetcher.FetchItemQueues.addFetchItem(FetchItemQueues.java:142)
- waiting to lock <0x000000066894b9d8> (a
org.apache.nutch.fetcher.FetchItemQueues)
at
org.apache.nutch.fetcher.FetchItemQueues.addFetchItem(FetchItemQueues.java:136)
at org.apache.nutch.fetcher.QueueFeeder.run(QueueFeeder.java:141)
{noformat}
Flamegraph of a profiler run
([async-profiler|https://github.com/async-profiler/async-profiler]) of a
"stale"/slow Fetcher map task (attached)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)