[ https://issues.apache.org/jira/browse/NUTCH-3067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17891816#comment-17891816 ]
ASF GitHub Bot commented on NUTCH-3067: --------------------------------------- sebastian-nagel commented on PR #827: URL: https://github.com/apache/nutch/pull/827#issuecomment-2429051412 This PR is successfully tested in production: Using the default of 30 minutes for `fetcher.exceptions.per.queue.clear.after` the number of FetchQueues hold stabilizes after half an hour because the queues with a single error and no new additions of URLs to the queue are removed after this time span. > Improve performance of FetchItemQueues if error state is preserved > ------------------------------------------------------------------ > > Key: NUTCH-3067 > URL: https://issues.apache.org/jira/browse/NUTCH-3067 > Project: Nutch > Issue Type: Bug > Components: fetcher > Affects Versions: 1.20 > Reporter: Sebastian Nagel > Assignee: Sebastian Nagel > Priority: Major > Fix For: 1.21 > > Attachments: > Screenshot_20240905_101623_fetcher_tasks_many_queues.png, > fetcher.map.20240711113925.925750.flamegraph.html > > > In certain cases the error state of a fetch queue needs to be > preserved, even if the queue is (currently) empty, because there might > be still URLs in the fetcher input not yet read by the QueueFeeder, > see NUTCH-2947. To keep the queue together with its state is necessary > - to skip queues together with all items queued now or to be queued > later by the QueueFeeder, if a queue exceeds the maximum configured > number of exceptions (NUTCH-769). This is mostly a performance feature, > but with implications for politeness because also HTTP 403 Forbidden > (and similar) are counted as "exceptions". > - to implement an exponential backoff which slows down the fetching from sites > responding with repeated "exceptions" (NUTCH-2946). > However, there is a drawback when all "stateful" queues are preserved > until the QueueFeeder has finished reading input fetch lists: Nutch's > fetch queue implementation becomes slow if there are too many queues. > This situation / issue was observed in the first cycle of a crawl > where only the homepages of millions of sites were fetched: > - about 1 million homepages per fetcher task > - about 25% of the homepage URLs caused exceptions - the fetch lists was not > filtered beforehand whether a site is reachable and is responding > - consequently, after a certain amount of time (3-4 hours) 250k queues per > task were "stateful" and preserved until the fetch list input was entirely > read by the QueueFeeder > - with too many queues and most of them empty (no URLs) the operations on the > queues become slow and fetching almost stale (see screenshot) > - many queues but few URLs queued (250k vs. 25) > - most fetcher threads (190 out of 240) waiting for the lock on one of the > synchronized methods of FetchItemQueues > - also the QueueFeeder is affected by the lock which explains why only few > URLs are queued > Important notes: this is not an issue > - if no error state is preserved, that is if > {{fetcher.max.exceptions.per.queue == -1}} and > {{fetcher.exceptions.per.queue.delay == 0.0}} > - or if the crawl isn't too "broad" in terms of the number of different hosts > (domains or IPs, depending on {{fetcher.queue.mode}}) > As possible solutions: > 1. do not keep every stateful queue: drop queues which have a low exception > count after a configurable amount of time. If a second URL from the same > host/domain/IP is fetched after a considerably long time span (eg. 30 > minutes), the effect on performance and politeness should be negligible. > 2. review the implementation of FetchItemQueues and the locking (synchronized > methods) > 3. at least, try to prioritize QueueFeeder, for example by a method which > adds multiple fetch items within one synchronized call > Details and data: > Screenshot of the Fetcher map task status in the Hadoop YARN Web UI (attached) > Counts of the top (deepest) line in the stack traces of all Fetcher threads: > {noformat} > 120 at > org.apache.nutch.fetcher.FetchItemQueues.getFetchItem(FetchItemQueues.java:177) > 49 at > org.apache.nutch.fetcher.FetchItemQueues.checkExceptionThreshold(FetchItemQueues.java:281) > 21 at > org.apache.nutch.fetcher.FetchItemQueues.getFetchItemQueue(FetchItemQueues.java:166) > 19 at > java.net.PlainSocketImpl.socketConnect(java.base@11.0.24/Native Method) > 18 at > java.net.SocketInputStream.socketRead0(java.base@11.0.24/Native Method) > 6 at java.lang.Object.wait(java.base@11.0.24/Native Method) # > waiting for HTTP/2 stream > 4 at java.lang.Thread.sleep(java.base@11.0.24/Native Method) > 2 at > java.net.Inet4AddressImpl.lookupAllHostAddr(java.base@11.0.24/Native Method) > 1 at > java.util.Collections$SynchronizedCollection.size(java.base@11.0.24/Collections.java:2017) > {noformat} > Full stack traces (three examples): > {noformat} > "FetcherThread" #38 daemon prio=5 os_prio=0 cpu=43743.17ms elapsed=15890.29s > tid=0x0000752967fff800 nid=0x83a3c waiting for monitor entry > [0x000075292fcf9000] > java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.nutch.fetcher.FetchItemQueues.getFetchItem(FetchItemQueues.java:177) > - waiting to lock <0x000000066894b9d8> (a > org.apache.nutch.fetcher.FetchItemQueues) > at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:301) > "FetcherThread" #72 daemon prio=5 os_prio=0 cpu=38381.67ms elapsed=15881.02s > tid=0x000075292822d000 nid=0x83a91 waiting for monitor entry > [0x0000752926cfe000] > java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.nutch.fetcher.FetchItemQueues.checkExceptionThreshold(FetchItemQueues.java:281) > - waiting to lock <0x000000066894b9d8> (a > org.apache.nutch.fetcher.FetchItemQueues) > at > org.apache.nutch.fetcher.FetchItemQueues.checkExceptionThreshold(FetchItemQueues.java:338) > at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:489) > "FetcherThread" #43 daemon prio=5 os_prio=0 cpu=39112.96ms elapsed=15889.09s > tid=0x0000752928361000 nid=0x83a41 waiting for monitor entry > [0x000075292d65f000] > java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.nutch.fetcher.FetchItemQueues.getFetchItemQueue(FetchItemQueues.java:166) > - waiting to lock <0x000000066894b9d8> (a > org.apache.nutch.fetcher.FetchItemQueues) > at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:345) > {noformat} > Stack of the blocked QueueFeeder: > {noformat} > "QueueFeeder" #31 daemon prio=5 os_prio=0 cpu=19415.88ms elapsed=15926.65s > tid=0x000075296780c800 nid=0x83a30 waiting for monitor entry > [0x000075292fff9000] > java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.nutch.fetcher.FetchItemQueues.addFetchItem(FetchItemQueues.java:142) > - waiting to lock <0x000000066894b9d8> (a > org.apache.nutch.fetcher.FetchItemQueues) > at > org.apache.nutch.fetcher.FetchItemQueues.addFetchItem(FetchItemQueues.java:136) > at org.apache.nutch.fetcher.QueueFeeder.run(QueueFeeder.java:141) > {noformat} > Flamegraph of a profiler run > ([async-profiler|https://github.com/async-profiler/async-profiler]) of a > "stale"/slow Fetcher map task (attached) -- This message was sent by Atlassian Jira (v8.20.10#820010)