This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
     new 14fc33099 NUTCH-3114 Avoid stale fetching when only URLs            
from queues blocked by the exponential backoff remain
14fc33099 is described below

commit 14fc3309998ca8d115a5f3d504e1859911660dc5
Author: Sebastian Nagel <[email protected]>
AuthorDate: Wed Jul 9 17:14:07 2025 +0200

    NUTCH-3114 Avoid stale fetching when only URLs
               from queues blocked by the exponential backoff remain
---
 conf/nutch-default.xml | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/conf/nutch-default.xml b/conf/nutch-default.xml
index 88d61b479..a8a953f75 100644
--- a/conf/nutch-default.xml
+++ b/conf/nutch-default.xml
@@ -1150,12 +1150,20 @@
 
 <property>
   <name>fetcher.max.exceptions.per.queue</name>
-  <value>-1</value>
+  <value>5</value>
   <description>The maximum number of protocol-level exceptions
   (e.g. timeouts) or HTTP status codes mapped to ProtocolStatus.EXCEPTION
   per host (or IP) queue. Once this value is reached, any remaining entries
   from this queue are purged, effectively stopping the fetching from this
-  host/IP. The default value of -1 deactivates this limit.
+  host/IP. A value of -1 deactivates this limit.
+
+  Note that the exponential backoff mechanism (see the property
+  fetcher.exceptions.per.queue.delay) causes increasing wait times
+  after each exception in a queue. If there is no time limit
+  (fetcher.timelimit.mins) or minimum throughput
+  (fetcher.throughput.threshold.pages) configured, it is recommended
+  to set this property to a considerably low value. This avoids the
+  fetch process from hanging when only URLs in blocked queues remain.
   </description>
 </property>
 

Reply via email to