[jira] [Created] (NUTCH-3067) Improve performance of FetchItemQueues if error state is preserved

Sebastian Nagel (Jira) Sat, 07 Sep 2024 06:44:14 -0700

Sebastian Nagel created NUTCH-3067:
--------------------------------------

             Summary: Improve performance of FetchItemQueues if error state is 
preserved
                 Key: NUTCH-3067
                 URL: https://issues.apache.org/jira/browse/NUTCH-3067
             Project: Nutch
          Issue Type: Bug
          Components: fetcher
    Affects Versions: 1.20
            Reporter: Sebastian Nagel
            Assignee: Sebastian Nagel
             Fix For: 1.21
         Attachments: Screenshot_20240905_101623_fetcher_tasks_many_queues.png, 
fetcher.map.20240711113925.925750.flamegraph.html


In certain cases the error state of a fetch queue needs to be
preserved, even if the queue is (currently) empty, because there might
be still URLs in the fetcher input not yet read by the QueueFeeder,
see NUTCH-2947. To keep the queue together with its state is necessary

- to skip queues together with all items queued now or to be queued
  later by the QueueFeeder, if a queue exceeds the maximum configured
  number of exceptions (NUTCH-769). This is mostly a performance feature,
  but with implications for politeness because also HTTP 403 Forbidden
  (and similar) are counted as "exceptions".

- to implement an exponential backoff which slows down the fetching from sites
  responding with repeated "exceptions" (NUTCH-2946).

However, there is a drawback when all "stateful" queues are preserved
until the QueueFeeder has finished reading input fetch lists: Nutch's
fetch queue implementation becomes slow if there are too many queues.
This situation / issue was observed in the first cycle of a crawl
where only the homepages of millions of sites were fetched:
- about 1 million homepages per fetcher task
- about 25% of the homepage URLs caused exceptions - the fetch lists was not 
filtered beforehand whether a site is reachable and is responding
- consequently, after a certain amount of time (3-4 hours) 250k queues per task 
were "stateful" and preserved until the fetch list input was entirely read by 
the QueueFeeder
- with too many queues and most of them empty (no URLs) the operations on the 
queues become slow and fetching almost stale (see screenshot)
  - many queues but few URLs queued (250k vs. 25)
  - most fetcher threads (190 out of 240) waiting for the lock on one of the 
synchronized methods of FetchItemQueues
  - also the QueueFeeder is affected by the lock which explains why only few 
URLs are queued

Important notes: this is not an issue
- if no error state is preserved, that is if {{fetcher.max.exceptions.per.queue 
== -1}} and {{fetcher.exceptions.per.queue.delay == 0.0}}
- or if the crawl isn't too "broad" in terms of the number of different hosts 
(domains or IPs, depending on {{fetcher.queue.mode}})

As possible solutions:

1. do not keep every stateful queue: drop queues which have a low exception 
count after a configurable amount of time. If a second URL from the same 
host/domain/IP is fetched after a considerably long time span (eg. 30 minutes), 
the effect on performance and politeness should be negligible.

2. review the implementation of FetchItemQueues and the locking (synchronized 
methods)

3. at least, try to prioritize QueueFeeder, for example by a method which adds 
multiple fetch items within one synchronized call


Details and data:

Screenshot of the Fetcher map task status in the Hadoop YARN Web UI (attached)

Counts of the top (deepest) line in the stack traces of all Fetcher threads:
{noformat}
120             at 
org.apache.nutch.fetcher.FetchItemQueues.getFetchItem(FetchItemQueues.java:177)
49              at 
org.apache.nutch.fetcher.FetchItemQueues.checkExceptionThreshold(FetchItemQueues.java:281)
21              at 
org.apache.nutch.fetcher.FetchItemQueues.getFetchItemQueue(FetchItemQueues.java:166)
19              at 
java.net.PlainSocketImpl.socketConnect(java.base@11.0.24/Native Method)
18              at 
java.net.SocketInputStream.socketRead0(java.base@11.0.24/Native Method)
6               at java.lang.Object.wait(java.base@11.0.24/Native Method)  # 
waiting for HTTP/2 stream
4               at java.lang.Thread.sleep(java.base@11.0.24/Native Method)
2               at 
java.net.Inet4AddressImpl.lookupAllHostAddr(java.base@11.0.24/Native Method)
1               at 
java.util.Collections$SynchronizedCollection.size(java.base@11.0.24/Collections.java:2017)
{noformat}

Full stack traces (three examples):
{noformat}
"FetcherThread" #38 daemon prio=5 os_prio=0 cpu=43743.17ms elapsed=15890.29s 
tid=0x0000752967fff800 nid=0x83a3c waiting for monitor entry  
[0x000075292fcf9000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at 
org.apache.nutch.fetcher.FetchItemQueues.getFetchItem(FetchItemQueues.java:177)
        - waiting to lock <0x000000066894b9d8> (a 
org.apache.nutch.fetcher.FetchItemQueues)
        at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:301)
"FetcherThread" #72 daemon prio=5 os_prio=0 cpu=38381.67ms elapsed=15881.02s 
tid=0x000075292822d000 nid=0x83a91 waiting for monitor entry  
[0x0000752926cfe000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at 
org.apache.nutch.fetcher.FetchItemQueues.checkExceptionThreshold(FetchItemQueues.java:281)
        - waiting to lock <0x000000066894b9d8> (a 
org.apache.nutch.fetcher.FetchItemQueues)
        at 
org.apache.nutch.fetcher.FetchItemQueues.checkExceptionThreshold(FetchItemQueues.java:338)
        at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:489)
"FetcherThread" #43 daemon prio=5 os_prio=0 cpu=39112.96ms elapsed=15889.09s 
tid=0x0000752928361000 nid=0x83a41 waiting for monitor entry  
[0x000075292d65f000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at 
org.apache.nutch.fetcher.FetchItemQueues.getFetchItemQueue(FetchItemQueues.java:166)
        - waiting to lock <0x000000066894b9d8> (a 
org.apache.nutch.fetcher.FetchItemQueues)
        at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:345)
{noformat}

Stack of the blocked QueueFeeder:
{noformat}
"QueueFeeder" #31 daemon prio=5 os_prio=0 cpu=19415.88ms elapsed=15926.65s 
tid=0x000075296780c800 nid=0x83a30 waiting for monitor entry  
[0x000075292fff9000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at 
org.apache.nutch.fetcher.FetchItemQueues.addFetchItem(FetchItemQueues.java:142)
        - waiting to lock <0x000000066894b9d8> (a 
org.apache.nutch.fetcher.FetchItemQueues)
        at 
org.apache.nutch.fetcher.FetchItemQueues.addFetchItem(FetchItemQueues.java:136)
        at org.apache.nutch.fetcher.QueueFeeder.run(QueueFeeder.java:141)
{noformat}

Flamegraph of a profiler run 
([async-profiler|https://github.com/async-profiler/async-profiler]) of a 
"stale"/slow Fetcher map task (attached)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (NUTCH-3067) Improve performance of FetchItemQueues if error state is preserved

Reply via email to