[jira] [Commented] (NUTCH-3067) Improve performance of FetchItemQueues if error state is preserved

ASF GitHub Bot (Jira) Tue, 22 Oct 2024 04:43:23 -0700


    [ 
https://issues.apache.org/jira/browse/NUTCH-3067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17891816#comment-17891816
 ]


ASF GitHub Bot commented on NUTCH-3067:
---------------------------------------

sebastian-nagel commented on PR #827:
URL: https://github.com/apache/nutch/pull/827#issuecomment-2429051412

   This PR is successfully tested in production: Using the default of 30 
minutes for `fetcher.exceptions.per.queue.clear.after` the number of 
FetchQueues hold stabilizes after half an hour because the queues with a single 
error and no new additions of URLs to the queue are removed after this time 
span.




> Improve performance of FetchItemQueues if error state is preserved
> ------------------------------------------------------------------
>
>                 Key: NUTCH-3067
>                 URL: https://issues.apache.org/jira/browse/NUTCH-3067
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.20
>            Reporter: Sebastian Nagel
>            Assignee: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.21
>
>         Attachments: 
> Screenshot_20240905_101623_fetcher_tasks_many_queues.png, 
> fetcher.map.20240711113925.925750.flamegraph.html
>
>
> In certain cases the error state of a fetch queue needs to be
> preserved, even if the queue is (currently) empty, because there might
> be still URLs in the fetcher input not yet read by the QueueFeeder,
> see NUTCH-2947. To keep the queue together with its state is necessary
> - to skip queues together with all items queued now or to be queued
>   later by the QueueFeeder, if a queue exceeds the maximum configured
>   number of exceptions (NUTCH-769). This is mostly a performance feature,
>   but with implications for politeness because also HTTP 403 Forbidden
>   (and similar) are counted as "exceptions".
> - to implement an exponential backoff which slows down the fetching from sites
>   responding with repeated "exceptions" (NUTCH-2946).
> However, there is a drawback when all "stateful" queues are preserved
> until the QueueFeeder has finished reading input fetch lists: Nutch's
> fetch queue implementation becomes slow if there are too many queues.
> This situation / issue was observed in the first cycle of a crawl
> where only the homepages of millions of sites were fetched:
> - about 1 million homepages per fetcher task
> - about 25% of the homepage URLs caused exceptions - the fetch lists was not 
> filtered beforehand whether a site is reachable and is responding
> - consequently, after a certain amount of time (3-4 hours) 250k queues per 
> task were "stateful" and preserved until the fetch list input was entirely 
> read by the QueueFeeder
> - with too many queues and most of them empty (no URLs) the operations on the 
> queues become slow and fetching almost stale (see screenshot)
>   - many queues but few URLs queued (250k vs. 25)
>   - most fetcher threads (190 out of 240) waiting for the lock on one of the 
> synchronized methods of FetchItemQueues
>   - also the QueueFeeder is affected by the lock which explains why only few 
> URLs are queued
> Important notes: this is not an issue
> - if no error state is preserved, that is if 
> {{fetcher.max.exceptions.per.queue == -1}} and 
> {{fetcher.exceptions.per.queue.delay == 0.0}}
> - or if the crawl isn't too "broad" in terms of the number of different hosts 
> (domains or IPs, depending on {{fetcher.queue.mode}})
> As possible solutions:
> 1. do not keep every stateful queue: drop queues which have a low exception 
> count after a configurable amount of time. If a second URL from the same 
> host/domain/IP is fetched after a considerably long time span (eg. 30 
> minutes), the effect on performance and politeness should be negligible.
> 2. review the implementation of FetchItemQueues and the locking (synchronized 
> methods)
> 3. at least, try to prioritize QueueFeeder, for example by a method which 
> adds multiple fetch items within one synchronized call
> Details and data:
> Screenshot of the Fetcher map task status in the Hadoop YARN Web UI (attached)
> Counts of the top (deepest) line in the stack traces of all Fetcher threads:
> {noformat}
> 120             at 
> org.apache.nutch.fetcher.FetchItemQueues.getFetchItem(FetchItemQueues.java:177)
> 49              at 
> org.apache.nutch.fetcher.FetchItemQueues.checkExceptionThreshold(FetchItemQueues.java:281)
> 21              at 
> org.apache.nutch.fetcher.FetchItemQueues.getFetchItemQueue(FetchItemQueues.java:166)
> 19              at 
> java.net.PlainSocketImpl.socketConnect(java.base@11.0.24/Native Method)
> 18              at 
> java.net.SocketInputStream.socketRead0(java.base@11.0.24/Native Method)
> 6               at java.lang.Object.wait(java.base@11.0.24/Native Method)  # 
> waiting for HTTP/2 stream
> 4               at java.lang.Thread.sleep(java.base@11.0.24/Native Method)
> 2               at 
> java.net.Inet4AddressImpl.lookupAllHostAddr(java.base@11.0.24/Native Method)
> 1               at 
> java.util.Collections$SynchronizedCollection.size(java.base@11.0.24/Collections.java:2017)
> {noformat}
> Full stack traces (three examples):
> {noformat}
> "FetcherThread" #38 daemon prio=5 os_prio=0 cpu=43743.17ms elapsed=15890.29s 
> tid=0x0000752967fff800 nid=0x83a3c waiting for monitor entry  
> [0x000075292fcf9000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>         at 
> org.apache.nutch.fetcher.FetchItemQueues.getFetchItem(FetchItemQueues.java:177)
>         - waiting to lock <0x000000066894b9d8> (a 
> org.apache.nutch.fetcher.FetchItemQueues)
>         at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:301)
> "FetcherThread" #72 daemon prio=5 os_prio=0 cpu=38381.67ms elapsed=15881.02s 
> tid=0x000075292822d000 nid=0x83a91 waiting for monitor entry  
> [0x0000752926cfe000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>         at 
> org.apache.nutch.fetcher.FetchItemQueues.checkExceptionThreshold(FetchItemQueues.java:281)
>         - waiting to lock <0x000000066894b9d8> (a 
> org.apache.nutch.fetcher.FetchItemQueues)
>         at 
> org.apache.nutch.fetcher.FetchItemQueues.checkExceptionThreshold(FetchItemQueues.java:338)
>         at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:489)
> "FetcherThread" #43 daemon prio=5 os_prio=0 cpu=39112.96ms elapsed=15889.09s 
> tid=0x0000752928361000 nid=0x83a41 waiting for monitor entry  
> [0x000075292d65f000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>         at 
> org.apache.nutch.fetcher.FetchItemQueues.getFetchItemQueue(FetchItemQueues.java:166)
>         - waiting to lock <0x000000066894b9d8> (a 
> org.apache.nutch.fetcher.FetchItemQueues)
>         at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:345)
> {noformat}
> Stack of the blocked QueueFeeder:
> {noformat}
> "QueueFeeder" #31 daemon prio=5 os_prio=0 cpu=19415.88ms elapsed=15926.65s 
> tid=0x000075296780c800 nid=0x83a30 waiting for monitor entry  
> [0x000075292fff9000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>         at 
> org.apache.nutch.fetcher.FetchItemQueues.addFetchItem(FetchItemQueues.java:142)
>         - waiting to lock <0x000000066894b9d8> (a 
> org.apache.nutch.fetcher.FetchItemQueues)
>         at 
> org.apache.nutch.fetcher.FetchItemQueues.addFetchItem(FetchItemQueues.java:136)
>         at org.apache.nutch.fetcher.QueueFeeder.run(QueueFeeder.java:141)
> {noformat}
> Flamegraph of a profiler run 
> ([async-profiler|https://github.com/async-profiler/async-profiler]) of a 
> "stale"/slow Fetcher map task (attached)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (NUTCH-3067) Improve performance of FetchItemQueues if error state is preserved

Reply via email to