Fetcher threads blocked on synchronized block in cleanExpiredServerBlocks
-------------------------------------------------------------------------

                 Key: NUTCH-344
                 URL: http://issues.apache.org/jira/browse/NUTCH-344
             Project: Nutch
          Issue Type: Bug
          Components: fetcher
    Affects Versions: 0.8.1, 0.9.0
         Environment: All
            Reporter: Greg Kim
         Attachments: cleanExpiredServerBlocks.patch

With the recent change to the following code in HttpBase.java has tendencies to 
block fetcher threads while one thread busy waits... 

  private static void cleanExpiredServerBlocks() {
    synchronized (BLOCKED_ADDR_TO_TIME) {
      while (!BLOCKED_ADDR_QUEUE.isEmpty()) {   <===== LINE 3:   
        String host = (String) BLOCKED_ADDR_QUEUE.getLast();
        long time = ((Long) BLOCKED_ADDR_TO_TIME.get(host)).longValue();
        if (time <= System.currentTimeMillis()) {   
          BLOCKED_ADDR_TO_TIME.remove(host);
          BLOCKED_ADDR_QUEUE.removeLast();
        }
      }
    }
  }

LINE3:  As long as there are *any* entries in the BLOCKED_ADDR_QUEUE, the 
thread that first enters this block busy-waits until it becomes empty while all 
other threads block on the synchronized block.  This leads to extremely poor 
fetcher performance.  

Since the checkin to respect crawlDelay in robots.txt, we are no longer 
guranteed that BLOCKED_ADDR_TO_TIME queue is a fifo list. The simple fix is to 
iterate the queue once rather than busy waiting...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to