Fetcher threads blocked on synchronized block in cleanExpiredServerBlocks
-------------------------------------------------------------------------
Key: NUTCH-344
URL: http://issues.apache.org/jira/browse/NUTCH-344
Project: Nutch
Issue Type: Bug
Components: fetcher
Affects Versions: 0.8.1, 0.9.0
Environment: All
Reporter: Greg Kim
Attachments: cleanExpiredServerBlocks.patch
With the recent change to the following code in HttpBase.java has tendencies to
block fetcher threads while one thread busy waits...
private static void cleanExpiredServerBlocks() {
synchronized (BLOCKED_ADDR_TO_TIME) {
while (!BLOCKED_ADDR_QUEUE.isEmpty()) { <===== LINE 3:
String host = (String) BLOCKED_ADDR_QUEUE.getLast();
long time = ((Long) BLOCKED_ADDR_TO_TIME.get(host)).longValue();
if (time <= System.currentTimeMillis()) {
BLOCKED_ADDR_TO_TIME.remove(host);
BLOCKED_ADDR_QUEUE.removeLast();
}
}
}
}
LINE3: As long as there are *any* entries in the BLOCKED_ADDR_QUEUE, the
thread that first enters this block busy-waits until it becomes empty while all
other threads block on the synchronized block. This leads to extremely poor
fetcher performance.
Since the checkin to respect crawlDelay in robots.txt, we are no longer
guranteed that BLOCKED_ADDR_TO_TIME queue is a fifo list. The simple fix is to
iterate the queue once rather than busy waiting...
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira