sebastian-nagel opened a new pull request, #832:
URL: https://github.com/apache/nutch/pull/832

   - Fetcher signals QueueFeeder (via FetchItemQueues) to stop queuing and 
waits 1.5 seconds for this to happen
   - QueueFeeder (if still alive) discards remaining items with QueuingStatus 
"HIT_BY_TIMEOUT"
   - add number of items skipped by QueueFeeder to "FetcherStatus" counters
   - mark variables defined in FetchItemQueues but accessed from QueueFeeder 
thread as "volatile"
   - minor code cleanups
   
   
   The solution was tested the following way:
   - the local Apache httpd was configured to respond with "HTTP 429 Too many 
requests" on a specific path
   - a seed list containing URLs with mostly URLs on this path on localhost was 
created and converted into a segment using the "freegen" tool
   - Fetcher was run with a single fetcher thread, a short queue and a high 
timeout divisor (= short timeout):
     ```
     nutch fetch -Dfetcher.queue.depth.multiplier=2 
-Dfetcher.threads.timeout.divisor=20 path/to/segment -threads 1
     ```
   
   Because most of the fetches failed with HTTP status code 429, the queue was 
throttled using the configured exponential backoff. When the delay was larger 
then the timeout, Fetcher stopped the QueueFeeder and finished. Below the log 
snippets from the hadoop.log.
   
   ```
   2024-10-24 14:20:10,021 INFO o.a.n.f.Fetcher [main] Fetcher: starting
   ...
   2024-10-24 14:20:10,934 INFO o.a.n.f.Fetcher [LocalJobRunner Map Task 
Executor #0] Fetcher: threads: 1
   2024-10-24 14:20:10,954 INFO o.a.n.f.Fetcher [LocalJobRunner Map Task 
Executor #0] Fetcher: time-out divisor: 20
   ...
   2024-10-24 14:20:11,008 INFO o.a.n.f.FetcherThread [LocalJobRunner Map Task 
Executor #0] FetcherThread 51 Using queue mode : byHost
   ...
   2024-10-24 14:20:11,009 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 57 fetching http://localhost/nutch/test-exception/19.html (queue 
crawl delay=3000ms)
   ...
   2024-10-24 14:20:11,590 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] 
http://localhost/robots.txt - http/1.1 301 Moved Permanently
   ...
   2024-10-24 14:20:11,594 DEBUG o.a.n.p.h.a.HttpRobotRulesParser 
[FetcherThread] Following robots.txt redirect: 
http://localhost/nutch/robots.txt -> http://localhost/nutch/rd1-robots.txt
   2024-10-24 14:20:11,595 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] 
http://localhost/nutch/rd1-robots.txt - http/1.1 302 Found
   ...
   2024-10-24 14:20:11,596 DEBUG o.a.n.p.h.a.HttpRobotRulesParser 
[FetcherThread] Following robots.txt redirect: 
http://localhost/nutch/rd1-robots.txt -> http://localhost/nutch/rd2-robots.txt
   2024-10-24 14:20:11,597 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] 
http://localhost/nutch/rd2-robots.txt - http/1.1 302 Found
   ...
   2024-10-24 14:20:11,597 DEBUG o.a.n.p.h.a.HttpRobotRulesParser 
[FetcherThread] Following robots.txt redirect: 
http://localhost/nutch/rd2-robots.txt -> 
http://localhost/nutch/target-robots.txt
   2024-10-24 14:20:11,601 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] 
http://localhost/nutch/target-robots.txt - http/1.1 200 OK
   ...
   2024-10-24 14:20:11,602 DEBUG o.a.n.p.h.a.HttpRobotRulesParser 
[FetcherThread] Fetched robots.txt for 
http://localhost/nutch/test-exception/19.html with status code 200
   2024-10-24 14:20:11,606 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] 
http://localhost/nutch/test-exception/19.html - http/1.1 429 Too Many Requests
   ...
   2024-10-24 14:20:11,708 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 57 fetch of http://localhost/nutch/test-exception/19.html failed 
with: Http code=429, url=http://localhost/nutch/test-exception/19.html
   2024-10-24 14:20:11,708 INFO o.a.n.f.FetchItemQueues [FetcherThread] * 
queue: localhost >> delayed next fetch by 3000 ms after 1 exceptions in queue
   ...
   2024-10-24 14:20:17,711 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 57 fetching http://localhost/nutch/test-exception/ (queue crawl 
delay=3000ms)
   2024-10-24 14:20:17,714 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] 
http://localhost/nutch/test-exception/ - http/1.1 429 Too Many Requests
   ...
   2024-10-24 14:20:17,718 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 57 fetch of http://localhost/nutch/test-exception/ failed with: 
Http code=429, url=http://localhost/nutch/test-exception/
   2024-10-24 14:20:17,718 INFO o.a.n.f.FetchItemQueues [FetcherThread] * 
queue: localhost >> delayed next fetch by 6000 ms after 2 exceptions in queue
   ...
   2024-10-24 14:21:11,787 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] 
http://localhost/nutch/test-exception/3.html - http/1.1 429 Too Many Requests
   2024-10-24 14:21:11,797 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 57 fetch of http://localhost/nutch/test-exception/3.html failed 
with: Http code=429, url=http://localhost/nutch/test-exception/3.html
   2024-10-24 14:21:11,798 INFO o.a.n.f.FetchItemQueues [FetcherThread] * 
queue: localhost >> delayed next fetch by 48000 ms after 5 exceptions in queue
   ...
   2024-10-24 14:21:42,095 INFO o.a.n.f.Fetcher [LocalJobRunner Map Task 
Executor #0] -activeThreads=1, spinWaiting=1, fetchQueues.totalSize=2, 
fetchQueues.getQueueCount=1
   2024-10-24 14:21:42,095 WARN o.a.n.f.Fetcher [LocalJobRunner Map Task 
Executor #0] Timeout reached with no new requests since 30000 seconds.
   2024-10-24 14:21:42,095 WARN o.a.n.f.Fetcher [LocalJobRunner Map Task 
Executor #0] Aborting with 1 hung threads (queue feeder still alive).
   2024-10-24 14:21:42,096 WARN o.a.n.f.Fetcher [LocalJobRunner Map Task 
Executor #0] Thread #0 hung while processing 
http://localhost/nutch/test-exception/3.html
   2024-10-24 14:21:42,096 WARN o.a.n.f.Fetcher [LocalJobRunner Map Task 
Executor #0] Stack of thread #0:
   java.base@11.0.24/java.lang.Thread.sleep(Native Method)
   app//org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:278)
   
   2024-10-24 14:21:42,096 INFO o.a.n.f.Fetcher [LocalJobRunner Map Task 
Executor #0] Signaled QueueFeeder to stop, waiting 1.5 seconds before exiting.
   2024-10-24 14:21:42,981 INFO o.a.n.f.QueueFeeder [QueueFeeder] QueueFeeder 
stopping, timeout reached.
   2024-10-24 14:21:42,982 INFO o.a.n.f.QueueFeeder [QueueFeeder] QueueFeeder 
finished: total 8 records
   2024-10-24 14:21:42,982 INFO o.a.n.f.QueueFeeder [QueueFeeder] QueueFeeder 
queuing status:
   2024-10-24 14:21:42,982 INFO o.a.n.f.QueueFeeder [QueueFeeder]       8       
SUCCESSFULLY_QUEUED
   2024-10-24 14:21:42,983 INFO o.a.n.f.QueueFeeder [QueueFeeder]       0       
ERROR_CREATE_FETCH_ITEM
   2024-10-24 14:21:42,983 INFO o.a.n.f.QueueFeeder [QueueFeeder]       0       
ABOVE_EXCEPTION_THRESHOLD
   2024-10-24 14:21:42,983 INFO o.a.n.f.QueueFeeder [QueueFeeder]       0       
HIT_BY_TIMELIMIT
   2024-10-24 14:21:42,983 INFO o.a.n.f.QueueFeeder [QueueFeeder]       16      
HIT_BY_TIMEOUT
   2024-10-24 14:21:43,597 WARN o.a.n.f.Fetcher [LocalJobRunner Map Task 
Executor #0] Aborting with 2 queued fetch items in 1 queues.
   2024-10-24 14:21:43,597 INFO o.a.n.f.FetchItemQueues [LocalJobRunner Map 
Task Executor #0] * queue: localhost >> dropping!
   2024-10-24 14:21:43,598 INFO o.a.n.f.FetchItemQueues [LocalJobRunner Map 
Task Executor #0] Emptied all queues: 1 queues with 2 items
   2024-10-24 14:21:43,639 WARN o.a.h.m.i.MetricsSystemImpl [pool-6-thread-1] 
JobTracker metrics system already initialized!
   2024-10-24 14:21:43,813 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 57 has no more work available
   2024-10-24 14:21:43,814 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 57 -finishing thread FetcherThread, activeThreads=0
   2024-10-24 14:21:43,851 INFO o.a.h.m.Job [main]  map 100% reduce 100%
   2024-10-24 14:21:43,851 INFO o.a.h.m.Job [main] Job job_local288913864_0001 
completed successfully
   2024-10-24 14:21:43,857 INFO o.a.h.m.Job [main] Counters: 35
       ...
        FetcherStatus
                bytes_downloaded=120
                exception=5
                hitByTimeout=16
                hungThreads=1
                success=1
       ...
   2024-10-24 14:21:43,858 INFO o.a.n.f.Fetcher [main] Fetcher: finished, 
elapsed: 93836 ms
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to