This is an automated email from the ASF dual-hosted git repository.
snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git
The following commit(s) were added to refs/heads/master by this push:
new 14fc33099 NUTCH-3114 Avoid stale fetching when only URLs
from queues blocked by the exponential backoff remain
14fc33099 is described below
commit 14fc3309998ca8d115a5f3d504e1859911660dc5
Author: Sebastian Nagel <[email protected]>
AuthorDate: Wed Jul 9 17:14:07 2025 +0200
NUTCH-3114 Avoid stale fetching when only URLs
from queues blocked by the exponential backoff remain
---
conf/nutch-default.xml | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/conf/nutch-default.xml b/conf/nutch-default.xml
index 88d61b479..a8a953f75 100644
--- a/conf/nutch-default.xml
+++ b/conf/nutch-default.xml
@@ -1150,12 +1150,20 @@
<property>
<name>fetcher.max.exceptions.per.queue</name>
- <value>-1</value>
+ <value>5</value>
<description>The maximum number of protocol-level exceptions
(e.g. timeouts) or HTTP status codes mapped to ProtocolStatus.EXCEPTION
per host (or IP) queue. Once this value is reached, any remaining entries
from this queue are purged, effectively stopping the fetching from this
- host/IP. The default value of -1 deactivates this limit.
+ host/IP. A value of -1 deactivates this limit.
+
+ Note that the exponential backoff mechanism (see the property
+ fetcher.exceptions.per.queue.delay) causes increasing wait times
+ after each exception in a queue. If there is no time limit
+ (fetcher.timelimit.mins) or minimum throughput
+ (fetcher.throughput.threshold.pages) configured, it is recommended
+ to set this property to a considerably low value. This avoids the
+ fetch process from hanging when only URLs in blocked queues remain.
</description>
</property>