[
https://issues.apache.org/jira/browse/NUTCH-3067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17891876#comment-17891876
]
ASF GitHub Bot commented on NUTCH-3067:
---------------------------------------
sebastian-nagel commented on PR #827:
URL: https://github.com/apache/nutch/pull/827#issuecomment-2429352484
(rebased on recent master)
> Improve performance of FetchItemQueues if error state is preserved
> ------------------------------------------------------------------
>
> Key: NUTCH-3067
> URL: https://issues.apache.org/jira/browse/NUTCH-3067
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Affects Versions: 1.20
> Reporter: Sebastian Nagel
> Assignee: Sebastian Nagel
> Priority: Major
> Fix For: 1.21
>
> Attachments:
> Screenshot_20240905_101623_fetcher_tasks_many_queues.png,
> fetcher.map.20240711113925.925750.flamegraph.html
>
>
> In certain cases the error state of a fetch queue needs to be
> preserved, even if the queue is (currently) empty, because there might
> be still URLs in the fetcher input not yet read by the QueueFeeder,
> see NUTCH-2947. To keep the queue together with its state is necessary
> - to skip queues together with all items queued now or to be queued
> later by the QueueFeeder, if a queue exceeds the maximum configured
> number of exceptions (NUTCH-769). This is mostly a performance feature,
> but with implications for politeness because also HTTP 403 Forbidden
> (and similar) are counted as "exceptions".
> - to implement an exponential backoff which slows down the fetching from sites
> responding with repeated "exceptions" (NUTCH-2946).
> However, there is a drawback when all "stateful" queues are preserved
> until the QueueFeeder has finished reading input fetch lists: Nutch's
> fetch queue implementation becomes slow if there are too many queues.
> This situation / issue was observed in the first cycle of a crawl
> where only the homepages of millions of sites were fetched:
> - about 1 million homepages per fetcher task
> - about 25% of the homepage URLs caused exceptions - the fetch lists was not
> filtered beforehand whether a site is reachable and is responding
> - consequently, after a certain amount of time (3-4 hours) 250k queues per
> task were "stateful" and preserved until the fetch list input was entirely
> read by the QueueFeeder
> - with too many queues and most of them empty (no URLs) the operations on the
> queues become slow and fetching almost stale (see screenshot)
> - many queues but few URLs queued (250k vs. 25)
> - most fetcher threads (190 out of 240) waiting for the lock on one of the
> synchronized methods of FetchItemQueues
> - also the QueueFeeder is affected by the lock which explains why only few
> URLs are queued
> Important notes: this is not an issue
> - if no error state is preserved, that is if
> {{fetcher.max.exceptions.per.queue == -1}} and
> {{fetcher.exceptions.per.queue.delay == 0.0}}
> - or if the crawl isn't too "broad" in terms of the number of different hosts
> (domains or IPs, depending on {{fetcher.queue.mode}})
> As possible solutions:
> 1. do not keep every stateful queue: drop queues which have a low exception
> count after a configurable amount of time. If a second URL from the same
> host/domain/IP is fetched after a considerably long time span (eg. 30
> minutes), the effect on performance and politeness should be negligible.
> 2. review the implementation of FetchItemQueues and the locking (synchronized
> methods)
> 3. at least, try to prioritize QueueFeeder, for example by a method which
> adds multiple fetch items within one synchronized call
> Details and data:
> Screenshot of the Fetcher map task status in the Hadoop YARN Web UI (attached)
> Counts of the top (deepest) line in the stack traces of all Fetcher threads:
> {noformat}
> 120 at
> org.apache.nutch.fetcher.FetchItemQueues.getFetchItem(FetchItemQueues.java:177)
> 49 at
> org.apache.nutch.fetcher.FetchItemQueues.checkExceptionThreshold(FetchItemQueues.java:281)
> 21 at
> org.apache.nutch.fetcher.FetchItemQueues.getFetchItemQueue(FetchItemQueues.java:166)
> 19 at
> java.net.PlainSocketImpl.socketConnect([email protected]/Native Method)
> 18 at
> java.net.SocketInputStream.socketRead0([email protected]/Native Method)
> 6 at java.lang.Object.wait([email protected]/Native Method) #
> waiting for HTTP/2 stream
> 4 at java.lang.Thread.sleep([email protected]/Native Method)
> 2 at
> java.net.Inet4AddressImpl.lookupAllHostAddr([email protected]/Native Method)
> 1 at
> java.util.Collections$SynchronizedCollection.size([email protected]/Collections.java:2017)
> {noformat}
> Full stack traces (three examples):
> {noformat}
> "FetcherThread" #38 daemon prio=5 os_prio=0 cpu=43743.17ms elapsed=15890.29s
> tid=0x0000752967fff800 nid=0x83a3c waiting for monitor entry
> [0x000075292fcf9000]
> java.lang.Thread.State: BLOCKED (on object monitor)
> at
> org.apache.nutch.fetcher.FetchItemQueues.getFetchItem(FetchItemQueues.java:177)
> - waiting to lock <0x000000066894b9d8> (a
> org.apache.nutch.fetcher.FetchItemQueues)
> at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:301)
> "FetcherThread" #72 daemon prio=5 os_prio=0 cpu=38381.67ms elapsed=15881.02s
> tid=0x000075292822d000 nid=0x83a91 waiting for monitor entry
> [0x0000752926cfe000]
> java.lang.Thread.State: BLOCKED (on object monitor)
> at
> org.apache.nutch.fetcher.FetchItemQueues.checkExceptionThreshold(FetchItemQueues.java:281)
> - waiting to lock <0x000000066894b9d8> (a
> org.apache.nutch.fetcher.FetchItemQueues)
> at
> org.apache.nutch.fetcher.FetchItemQueues.checkExceptionThreshold(FetchItemQueues.java:338)
> at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:489)
> "FetcherThread" #43 daemon prio=5 os_prio=0 cpu=39112.96ms elapsed=15889.09s
> tid=0x0000752928361000 nid=0x83a41 waiting for monitor entry
> [0x000075292d65f000]
> java.lang.Thread.State: BLOCKED (on object monitor)
> at
> org.apache.nutch.fetcher.FetchItemQueues.getFetchItemQueue(FetchItemQueues.java:166)
> - waiting to lock <0x000000066894b9d8> (a
> org.apache.nutch.fetcher.FetchItemQueues)
> at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:345)
> {noformat}
> Stack of the blocked QueueFeeder:
> {noformat}
> "QueueFeeder" #31 daemon prio=5 os_prio=0 cpu=19415.88ms elapsed=15926.65s
> tid=0x000075296780c800 nid=0x83a30 waiting for monitor entry
> [0x000075292fff9000]
> java.lang.Thread.State: BLOCKED (on object monitor)
> at
> org.apache.nutch.fetcher.FetchItemQueues.addFetchItem(FetchItemQueues.java:142)
> - waiting to lock <0x000000066894b9d8> (a
> org.apache.nutch.fetcher.FetchItemQueues)
> at
> org.apache.nutch.fetcher.FetchItemQueues.addFetchItem(FetchItemQueues.java:136)
> at org.apache.nutch.fetcher.QueueFeeder.run(QueueFeeder.java:141)
> {noformat}
> Flamegraph of a profiler run
> ([async-profiler|https://github.com/async-profiler/async-profiler]) of a
> "stale"/slow Fetcher map task (attached)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)