sebastian-nagel opened a new pull request, #832: URL: https://github.com/apache/nutch/pull/832
- Fetcher signals QueueFeeder (via FetchItemQueues) to stop queuing and waits 1.5 seconds for this to happen - QueueFeeder (if still alive) discards remaining items with QueuingStatus "HIT_BY_TIMEOUT" - add number of items skipped by QueueFeeder to "FetcherStatus" counters - mark variables defined in FetchItemQueues but accessed from QueueFeeder thread as "volatile" - minor code cleanups The solution was tested the following way: - the local Apache httpd was configured to respond with "HTTP 429 Too many requests" on a specific path - a seed list containing URLs with mostly URLs on this path on localhost was created and converted into a segment using the "freegen" tool - Fetcher was run with a single fetcher thread, a short queue and a high timeout divisor (= short timeout): ``` nutch fetch -Dfetcher.queue.depth.multiplier=2 -Dfetcher.threads.timeout.divisor=20 path/to/segment -threads 1 ``` Because most of the fetches failed with HTTP status code 429, the queue was throttled using the configured exponential backoff. When the delay was larger then the timeout, Fetcher stopped the QueueFeeder and finished. Below the log snippets from the hadoop.log. ``` 2024-10-24 14:20:10,021 INFO o.a.n.f.Fetcher [main] Fetcher: starting ... 2024-10-24 14:20:10,934 INFO o.a.n.f.Fetcher [LocalJobRunner Map Task Executor #0] Fetcher: threads: 1 2024-10-24 14:20:10,954 INFO o.a.n.f.Fetcher [LocalJobRunner Map Task Executor #0] Fetcher: time-out divisor: 20 ... 2024-10-24 14:20:11,008 INFO o.a.n.f.FetcherThread [LocalJobRunner Map Task Executor #0] FetcherThread 51 Using queue mode : byHost ... 2024-10-24 14:20:11,009 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 57 fetching http://localhost/nutch/test-exception/19.html (queue crawl delay=3000ms) ... 2024-10-24 14:20:11,590 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] http://localhost/robots.txt - http/1.1 301 Moved Permanently ... 2024-10-24 14:20:11,594 DEBUG o.a.n.p.h.a.HttpRobotRulesParser [FetcherThread] Following robots.txt redirect: http://localhost/nutch/robots.txt -> http://localhost/nutch/rd1-robots.txt 2024-10-24 14:20:11,595 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] http://localhost/nutch/rd1-robots.txt - http/1.1 302 Found ... 2024-10-24 14:20:11,596 DEBUG o.a.n.p.h.a.HttpRobotRulesParser [FetcherThread] Following robots.txt redirect: http://localhost/nutch/rd1-robots.txt -> http://localhost/nutch/rd2-robots.txt 2024-10-24 14:20:11,597 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] http://localhost/nutch/rd2-robots.txt - http/1.1 302 Found ... 2024-10-24 14:20:11,597 DEBUG o.a.n.p.h.a.HttpRobotRulesParser [FetcherThread] Following robots.txt redirect: http://localhost/nutch/rd2-robots.txt -> http://localhost/nutch/target-robots.txt 2024-10-24 14:20:11,601 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] http://localhost/nutch/target-robots.txt - http/1.1 200 OK ... 2024-10-24 14:20:11,602 DEBUG o.a.n.p.h.a.HttpRobotRulesParser [FetcherThread] Fetched robots.txt for http://localhost/nutch/test-exception/19.html with status code 200 2024-10-24 14:20:11,606 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] http://localhost/nutch/test-exception/19.html - http/1.1 429 Too Many Requests ... 2024-10-24 14:20:11,708 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 57 fetch of http://localhost/nutch/test-exception/19.html failed with: Http code=429, url=http://localhost/nutch/test-exception/19.html 2024-10-24 14:20:11,708 INFO o.a.n.f.FetchItemQueues [FetcherThread] * queue: localhost >> delayed next fetch by 3000 ms after 1 exceptions in queue ... 2024-10-24 14:20:17,711 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 57 fetching http://localhost/nutch/test-exception/ (queue crawl delay=3000ms) 2024-10-24 14:20:17,714 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] http://localhost/nutch/test-exception/ - http/1.1 429 Too Many Requests ... 2024-10-24 14:20:17,718 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 57 fetch of http://localhost/nutch/test-exception/ failed with: Http code=429, url=http://localhost/nutch/test-exception/ 2024-10-24 14:20:17,718 INFO o.a.n.f.FetchItemQueues [FetcherThread] * queue: localhost >> delayed next fetch by 6000 ms after 2 exceptions in queue ... 2024-10-24 14:21:11,787 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] http://localhost/nutch/test-exception/3.html - http/1.1 429 Too Many Requests 2024-10-24 14:21:11,797 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 57 fetch of http://localhost/nutch/test-exception/3.html failed with: Http code=429, url=http://localhost/nutch/test-exception/3.html 2024-10-24 14:21:11,798 INFO o.a.n.f.FetchItemQueues [FetcherThread] * queue: localhost >> delayed next fetch by 48000 ms after 5 exceptions in queue ... 2024-10-24 14:21:42,095 INFO o.a.n.f.Fetcher [LocalJobRunner Map Task Executor #0] -activeThreads=1, spinWaiting=1, fetchQueues.totalSize=2, fetchQueues.getQueueCount=1 2024-10-24 14:21:42,095 WARN o.a.n.f.Fetcher [LocalJobRunner Map Task Executor #0] Timeout reached with no new requests since 30000 seconds. 2024-10-24 14:21:42,095 WARN o.a.n.f.Fetcher [LocalJobRunner Map Task Executor #0] Aborting with 1 hung threads (queue feeder still alive). 2024-10-24 14:21:42,096 WARN o.a.n.f.Fetcher [LocalJobRunner Map Task Executor #0] Thread #0 hung while processing http://localhost/nutch/test-exception/3.html 2024-10-24 14:21:42,096 WARN o.a.n.f.Fetcher [LocalJobRunner Map Task Executor #0] Stack of thread #0: java.base@11.0.24/java.lang.Thread.sleep(Native Method) app//org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:278) 2024-10-24 14:21:42,096 INFO o.a.n.f.Fetcher [LocalJobRunner Map Task Executor #0] Signaled QueueFeeder to stop, waiting 1.5 seconds before exiting. 2024-10-24 14:21:42,981 INFO o.a.n.f.QueueFeeder [QueueFeeder] QueueFeeder stopping, timeout reached. 2024-10-24 14:21:42,982 INFO o.a.n.f.QueueFeeder [QueueFeeder] QueueFeeder finished: total 8 records 2024-10-24 14:21:42,982 INFO o.a.n.f.QueueFeeder [QueueFeeder] QueueFeeder queuing status: 2024-10-24 14:21:42,982 INFO o.a.n.f.QueueFeeder [QueueFeeder] 8 SUCCESSFULLY_QUEUED 2024-10-24 14:21:42,983 INFO o.a.n.f.QueueFeeder [QueueFeeder] 0 ERROR_CREATE_FETCH_ITEM 2024-10-24 14:21:42,983 INFO o.a.n.f.QueueFeeder [QueueFeeder] 0 ABOVE_EXCEPTION_THRESHOLD 2024-10-24 14:21:42,983 INFO o.a.n.f.QueueFeeder [QueueFeeder] 0 HIT_BY_TIMELIMIT 2024-10-24 14:21:42,983 INFO o.a.n.f.QueueFeeder [QueueFeeder] 16 HIT_BY_TIMEOUT 2024-10-24 14:21:43,597 WARN o.a.n.f.Fetcher [LocalJobRunner Map Task Executor #0] Aborting with 2 queued fetch items in 1 queues. 2024-10-24 14:21:43,597 INFO o.a.n.f.FetchItemQueues [LocalJobRunner Map Task Executor #0] * queue: localhost >> dropping! 2024-10-24 14:21:43,598 INFO o.a.n.f.FetchItemQueues [LocalJobRunner Map Task Executor #0] Emptied all queues: 1 queues with 2 items 2024-10-24 14:21:43,639 WARN o.a.h.m.i.MetricsSystemImpl [pool-6-thread-1] JobTracker metrics system already initialized! 2024-10-24 14:21:43,813 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 57 has no more work available 2024-10-24 14:21:43,814 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 57 -finishing thread FetcherThread, activeThreads=0 2024-10-24 14:21:43,851 INFO o.a.h.m.Job [main] map 100% reduce 100% 2024-10-24 14:21:43,851 INFO o.a.h.m.Job [main] Job job_local288913864_0001 completed successfully 2024-10-24 14:21:43,857 INFO o.a.h.m.Job [main] Counters: 35 ... FetcherStatus bytes_downloaded=120 exception=5 hitByTimeout=16 hungThreads=1 success=1 ... 2024-10-24 14:21:43,858 INFO o.a.n.f.Fetcher [main] Fetcher: finished, elapsed: 93836 ms ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org