[ http://issues.apache.org/jira/browse/NUTCH-344?page=comments#action_12427096 ] Jacob Brunson commented on NUTCH-344: -------------------------------------
I'm having problems with the patch committed in revision #429779. I used to be having the "fetch aborted with X hung threads" problem. After updating to this revision, fetching goes fine for a while, but then I get this error on just about every page fetch attempt: 2006-08-09 23:27:28,548 INFO fetcher.Fetcher - fetching http://www.xmission.com/~nelsonb/resources.htm 2006-08-09 23:27:28,549 ERROR http.Http - java.lang.NullPointerException 2006-08-09 23:27:28,549 ERROR http.Http - at org.apache.nutch.protocol.http.api.HttpBase.cleanExpiredServerBlocks(HttpBase.java:382) 2006-08-09 23:27:28,549 ERROR http.Http - at org.apache.nutch.protocol.http.api.HttpBase.blockAddr(HttpBase.java:323) 2006-08-09 23:27:28,549 ERROR http.Http - at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:188) 2006-08-09 23:27:28,549 ERROR http.Http - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:144) 2006-08-09 23:27:28,549 INFO fetcher.Fetcher - fetch of http://www.xmission.com/~nelsonb/resources.htm failed with: java.lang.NullPointerException > Fetcher threads blocked on synchronized block in cleanExpiredServerBlocks > ------------------------------------------------------------------------- > > Key: NUTCH-344 > URL: http://issues.apache.org/jira/browse/NUTCH-344 > Project: Nutch > Issue Type: Bug > Components: fetcher > Affects Versions: 0.8.1, 0.9.0, 0.8 > Environment: All > Reporter: Greg Kim > Fix For: 0.8.1, 0.9.0 > > Attachments: cleanExpiredServerBlocks.patch, HttpBase.patch > > > With the recent change to the following code in HttpBase.java has tendencies > to block fetcher threads while one thread busy waits... > private static void cleanExpiredServerBlocks() { > synchronized (BLOCKED_ADDR_TO_TIME) { > while (!BLOCKED_ADDR_QUEUE.isEmpty()) { <===== LINE 3: > String host = (String) BLOCKED_ADDR_QUEUE.getLast(); > long time = ((Long) BLOCKED_ADDR_TO_TIME.get(host)).longValue(); > if (time <= System.currentTimeMillis()) { > BLOCKED_ADDR_TO_TIME.remove(host); > BLOCKED_ADDR_QUEUE.removeLast(); > } > } > } > } > LINE3: As long as there are *any* entries in the BLOCKED_ADDR_QUEUE, the > thread that first enters this block busy-waits until it becomes empty while > all other threads block on the synchronized block. This leads to extremely > poor fetcher performance. > Since the checkin to respect crawlDelay in robots.txt, we are no longer > guranteed that BLOCKED_ADDR_TO_TIME queue is a fifo list. The simple fix is > to iterate the queue once rather than busy waiting... -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
