[ https://issues.apache.org/jira/browse/NUTCH-3072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17892467#comment-17892467 ]
ASF GitHub Bot commented on NUTCH-3072: --------------------------------------- sebastian-nagel opened a new pull request, #832: URL: https://github.com/apache/nutch/pull/832 - Fetcher signals QueueFeeder (via FetchItemQueues) to stop queuing and waits 1.5 seconds for this to happen - QueueFeeder (if still alive) discards remaining items with QueuingStatus "HIT_BY_TIMEOUT" - add number of items skipped by QueueFeeder to "FetcherStatus" counters - mark variables defined in FetchItemQueues but accessed from QueueFeeder thread as "volatile" - minor code cleanups The solution was tested the following way: - the local Apache httpd was configured to respond with "HTTP 429 Too many requests" on a specific path - a seed list containing URLs with mostly URLs on this path on localhost was created and converted into a segment using the "freegen" tool - Fetcher was run with a single fetcher thread, a short queue and a high timeout divisor (= short timeout): ``` nutch fetch -Dfetcher.queue.depth.multiplier=2 -Dfetcher.threads.timeout.divisor=20 path/to/segment -threads 1 ``` Because most of the fetches failed with HTTP status code 429, the queue was throttled using the configured exponential backoff. When the delay was larger then the timeout, Fetcher stopped the QueueFeeder and finished. Below the log snippets from the hadoop.log. ``` 2024-10-24 14:20:10,021 INFO o.a.n.f.Fetcher [main] Fetcher: starting ... 2024-10-24 14:20:10,934 INFO o.a.n.f.Fetcher [LocalJobRunner Map Task Executor #0] Fetcher: threads: 1 2024-10-24 14:20:10,954 INFO o.a.n.f.Fetcher [LocalJobRunner Map Task Executor #0] Fetcher: time-out divisor: 20 ... 2024-10-24 14:20:11,008 INFO o.a.n.f.FetcherThread [LocalJobRunner Map Task Executor #0] FetcherThread 51 Using queue mode : byHost ... 2024-10-24 14:20:11,009 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 57 fetching http://localhost/nutch/test-exception/19.html (queue crawl delay=3000ms) ... 2024-10-24 14:20:11,590 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] http://localhost/robots.txt - http/1.1 301 Moved Permanently ... 2024-10-24 14:20:11,594 DEBUG o.a.n.p.h.a.HttpRobotRulesParser [FetcherThread] Following robots.txt redirect: http://localhost/nutch/robots.txt -> http://localhost/nutch/rd1-robots.txt 2024-10-24 14:20:11,595 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] http://localhost/nutch/rd1-robots.txt - http/1.1 302 Found ... 2024-10-24 14:20:11,596 DEBUG o.a.n.p.h.a.HttpRobotRulesParser [FetcherThread] Following robots.txt redirect: http://localhost/nutch/rd1-robots.txt -> http://localhost/nutch/rd2-robots.txt 2024-10-24 14:20:11,597 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] http://localhost/nutch/rd2-robots.txt - http/1.1 302 Found ... 2024-10-24 14:20:11,597 DEBUG o.a.n.p.h.a.HttpRobotRulesParser [FetcherThread] Following robots.txt redirect: http://localhost/nutch/rd2-robots.txt -> http://localhost/nutch/target-robots.txt 2024-10-24 14:20:11,601 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] http://localhost/nutch/target-robots.txt - http/1.1 200 OK ... 2024-10-24 14:20:11,602 DEBUG o.a.n.p.h.a.HttpRobotRulesParser [FetcherThread] Fetched robots.txt for http://localhost/nutch/test-exception/19.html with status code 200 2024-10-24 14:20:11,606 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] http://localhost/nutch/test-exception/19.html - http/1.1 429 Too Many Requests ... 2024-10-24 14:20:11,708 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 57 fetch of http://localhost/nutch/test-exception/19.html failed with: Http code=429, url=http://localhost/nutch/test-exception/19.html 2024-10-24 14:20:11,708 INFO o.a.n.f.FetchItemQueues [FetcherThread] * queue: localhost >> delayed next fetch by 3000 ms after 1 exceptions in queue ... 2024-10-24 14:20:17,711 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 57 fetching http://localhost/nutch/test-exception/ (queue crawl delay=3000ms) 2024-10-24 14:20:17,714 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] http://localhost/nutch/test-exception/ - http/1.1 429 Too Many Requests ... 2024-10-24 14:20:17,718 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 57 fetch of http://localhost/nutch/test-exception/ failed with: Http code=429, url=http://localhost/nutch/test-exception/ 2024-10-24 14:20:17,718 INFO o.a.n.f.FetchItemQueues [FetcherThread] * queue: localhost >> delayed next fetch by 6000 ms after 2 exceptions in queue ... 2024-10-24 14:21:11,787 DEBUG o.a.n.p.o.OkHttpResponse [FetcherThread] http://localhost/nutch/test-exception/3.html - http/1.1 429 Too Many Requests 2024-10-24 14:21:11,797 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 57 fetch of http://localhost/nutch/test-exception/3.html failed with: Http code=429, url=http://localhost/nutch/test-exception/3.html 2024-10-24 14:21:11,798 INFO o.a.n.f.FetchItemQueues [FetcherThread] * queue: localhost >> delayed next fetch by 48000 ms after 5 exceptions in queue ... 2024-10-24 14:21:42,095 INFO o.a.n.f.Fetcher [LocalJobRunner Map Task Executor #0] -activeThreads=1, spinWaiting=1, fetchQueues.totalSize=2, fetchQueues.getQueueCount=1 2024-10-24 14:21:42,095 WARN o.a.n.f.Fetcher [LocalJobRunner Map Task Executor #0] Timeout reached with no new requests since 30000 seconds. 2024-10-24 14:21:42,095 WARN o.a.n.f.Fetcher [LocalJobRunner Map Task Executor #0] Aborting with 1 hung threads (queue feeder still alive). 2024-10-24 14:21:42,096 WARN o.a.n.f.Fetcher [LocalJobRunner Map Task Executor #0] Thread #0 hung while processing http://localhost/nutch/test-exception/3.html 2024-10-24 14:21:42,096 WARN o.a.n.f.Fetcher [LocalJobRunner Map Task Executor #0] Stack of thread #0: java.base@11.0.24/java.lang.Thread.sleep(Native Method) app//org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:278) 2024-10-24 14:21:42,096 INFO o.a.n.f.Fetcher [LocalJobRunner Map Task Executor #0] Signaled QueueFeeder to stop, waiting 1.5 seconds before exiting. 2024-10-24 14:21:42,981 INFO o.a.n.f.QueueFeeder [QueueFeeder] QueueFeeder stopping, timeout reached. 2024-10-24 14:21:42,982 INFO o.a.n.f.QueueFeeder [QueueFeeder] QueueFeeder finished: total 8 records 2024-10-24 14:21:42,982 INFO o.a.n.f.QueueFeeder [QueueFeeder] QueueFeeder queuing status: 2024-10-24 14:21:42,982 INFO o.a.n.f.QueueFeeder [QueueFeeder] 8 SUCCESSFULLY_QUEUED 2024-10-24 14:21:42,983 INFO o.a.n.f.QueueFeeder [QueueFeeder] 0 ERROR_CREATE_FETCH_ITEM 2024-10-24 14:21:42,983 INFO o.a.n.f.QueueFeeder [QueueFeeder] 0 ABOVE_EXCEPTION_THRESHOLD 2024-10-24 14:21:42,983 INFO o.a.n.f.QueueFeeder [QueueFeeder] 0 HIT_BY_TIMELIMIT 2024-10-24 14:21:42,983 INFO o.a.n.f.QueueFeeder [QueueFeeder] 16 HIT_BY_TIMEOUT 2024-10-24 14:21:43,597 WARN o.a.n.f.Fetcher [LocalJobRunner Map Task Executor #0] Aborting with 2 queued fetch items in 1 queues. 2024-10-24 14:21:43,597 INFO o.a.n.f.FetchItemQueues [LocalJobRunner Map Task Executor #0] * queue: localhost >> dropping! 2024-10-24 14:21:43,598 INFO o.a.n.f.FetchItemQueues [LocalJobRunner Map Task Executor #0] Emptied all queues: 1 queues with 2 items 2024-10-24 14:21:43,639 WARN o.a.h.m.i.MetricsSystemImpl [pool-6-thread-1] JobTracker metrics system already initialized! 2024-10-24 14:21:43,813 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 57 has no more work available 2024-10-24 14:21:43,814 INFO o.a.n.f.FetcherThread [FetcherThread] FetcherThread 57 -finishing thread FetcherThread, activeThreads=0 2024-10-24 14:21:43,851 INFO o.a.h.m.Job [main] map 100% reduce 100% 2024-10-24 14:21:43,851 INFO o.a.h.m.Job [main] Job job_local288913864_0001 completed successfully 2024-10-24 14:21:43,857 INFO o.a.h.m.Job [main] Counters: 35 ... FetcherStatus bytes_downloaded=120 exception=5 hitByTimeout=16 hungThreads=1 success=1 ... 2024-10-24 14:21:43,858 INFO o.a.n.f.Fetcher [main] Fetcher: finished, elapsed: 93836 ms ``` > Fetcher to stop QueueFeeder if aborting with "hung threads" > ----------------------------------------------------------- > > Key: NUTCH-3072 > URL: https://issues.apache.org/jira/browse/NUTCH-3072 > Project: Nutch > Issue Type: Bug > Components: fetcher > Affects Versions: 1.20 > Reporter: Sebastian Nagel > Assignee: Sebastian Nagel > Priority: Major > Fix For: 1.21 > > > Fetcher is shutting down if there is no progress (not a single URL fetched) > during half of the MapReduce task timeout, see fetcher.thread.timeout.divisor > and NUTCH-1057. Before the shut-down Fetcher reports the active > FetcherThreads as "hung threads" and drops existing FetchQueues. After that > the task continues with sorting and merging the spilled data. FetcherThreads > and also the QueueFeeder might be still running in this moment, which opens > potential concurrency issues when a FetcherThread writes output data while > the output is already in the process of sorting. > Fetcher should stop the QueueFeeder and/or make sure it isn't alive anymore. > In addition, a short wait (one second) should help FetcherThreads to shut > down. > The issue was observed while testing a solution for NUTCH-3067. > {noformat} > 2024-10-02 09:33:18,796 INFO [main] fetcher.Fetcher: -activeThreads=120, > spinWaiting=119, fetchQueues.totalSize=12000, fetchQueues.getQueueCount=9884 > 2024-10-02 09:33:18,797 WARN [main] fetcher.Fetcher: Aborting with 120 hung > threads. > ... > 2024-10-02 09:33:18,828 WARN [main] fetcher.Fetcher: Aborting with 12000 > queued fetch items in 9884 queues (queue feeder still alive). > 2024-10-02 09:33:18,828 DEBUG [FetcherThread] fetcher.FetcherThread: > FetcherThread spin-waiting ... > ... (reporting dropped queues) > 2024-10-02 09:33:18,903 INFO [main] fetcher.FetchItemQueues: Emptied all > queues: 9279 queues with 12000 items > 2024-10-02 09:33:18,906 DEBUG [FetcherThread] fetcher.FetcherThread: > FetcherThread spin-waiting ... > 2024-10-02 09:33:18,906 INFO [main] mapred.MapTask: Starting flush of map > output > 2024-10-02 09:33:18,906 INFO [main] mapred.MapTask: Spilling map output > 2024-10-02 09:33:18,906 INFO [main] mapred.MapTask: bufstart = 124101175; > bufend = 177094062; bufvoid = 314572800 > 2024-10-02 09:33:18,906 INFO [main] mapred.MapTask: kvstart = > 31025288(124101152); kvend = 30983880(123935520); length = 41409/19660800 > 2024-10-02 09:33:18,907 DEBUG [FetcherThread] fetcher.FetcherThread: > FetcherThread spin-waiting ... > ... > 2024-10-02 09:33:19,292 DEBUG [FetcherThread] fetcher.FetcherThread: > FetcherThread spin-waiting ... > 2024-10-02 09:33:19,294 INFO [main] mapred.Merger: Merging 4 sorted segments > 2024-10-02 09:33:19,295 INFO [main] mapred.Merger: Down to the last > merge-pass, with 4 segments left of total size: 979973 bytes > 2024-10-02 09:33:19,296 DEBUG [FetcherThread] fetcher.FetcherThread: > FetcherThread spin-waiting ... > ... > 2024-10-02 09:33:19,478 DEBUG [FetcherThread] fetcher.FetcherThread: > FetcherThread spin-waiting ... > 2024-10-02 09:33:19,480 DEBUG [QueueFeeder] fetcher.QueueFeeder: -feeding > 12000 input urls ... > 2024-10-02 09:33:19,480 DEBUG [FetcherThread] fetcher.FetcherThread: > FetcherThread spin-waiting ... > 2024-10-02 09:33:19,481 DEBUG [FetcherThread] fetcher.FetcherThread: > FetcherThread spin-waiting ... > 2024-10-02 09:33:19,481 ERROR [QueueFeeder] fetcher.QueueFeeder: QueueFeeder > error reading input, record 89118 > java.io.IOException: Stream closed > at > org.apache.hadoop.io.compress.DecompressorStream.checkStream(DecompressorStream.java:184) > ... > at org.apache.nutch.fetcher.QueueFeeder.run(QueueFeeder.java:120) > ... > 2024-10-02 09:33:19,507 INFO [FetcherThread] fetcher.FetcherThread: > FetcherThread 1285 has no more work available > 2024-10-02 09:33:19,507 INFO [FetcherThread] fetcher.FetcherThread: > FetcherThread 1285 -finishing thread FetcherThread, activeThreads=104 > 2024-10-02 09:33:19,507 INFO [main] mapred.Merger: Merging 4 sorted segments > 2024-10-02 09:33:19,508 INFO [FetcherThread] fetcher.FetcherThread: > FetcherThread 319 has no more work available > 2024-10-02 09:33:19,508 INFO [FetcherThread] fetcher.FetcherThread: > FetcherThread 319 -finishing thread FetcherThread, activeThreads=103 > 2024-10-02 09:33:19,509 INFO [main] mapred.Merger: Down to the last > merge-pass, with 4 segments left of total size: 959873 bytes > 2024-10-02 09:33:19,509 INFO [FetcherThread] fetcher.FetcherThread: > FetcherThread 1112 has no more work available > 2024-10-02 09:33:19,509 INFO [FetcherThread] fetcher.FetcherThread: > FetcherThread 1112 -finishing thread FetcherThread, activeThreads=102 > ... > {noformat} > Later on, one of the reducer tasks failed when reading data. This caused the > job to fail: > {noformat} > 2024-10-02 10:37:31,347 INFO mapreduce.Job: map 100% reduce 92% > 2024-10-02 10:37:31,347 INFO mapreduce.Job: Task Id : > attempt_1727715735602_0060_r_000007_1, Status : FAILED > Error: java.lang.ArrayIndexOutOfBoundsException: Index 114 out of bounds for > length 25 > at > org.apache.nutch.util.GenericWritableConfigurable.readFields(GenericWritableConfigurable.java:49) > ... > 2024-10-02 11:08:34,260 ERROR fetcher.Fetcher: Fetcher: > java.lang.RuntimeException: Fetcher job did not succeed, job id: > job_1727715735602_0060, job status: FAILED, reason: Task failed > task_1727715735602_0060_r_000007 > Job failed as tasks failed. failedMaps:0 failedReduces:1 killedMaps:0 > killedReduces: 0 > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)