[ http://issues.apache.org/jira/browse/NUTCH-344?page=all ]
Jason Calabrese updated NUTCH-344:
----------------------------------
Attachment: HttpBase.patch
This fix missed 1 little change that caused BLOCKED_ADDR_TO_TIME and
BLOCKED_ADDR_QUEUE to get out of sync.
To fix the problem you only need to change the remove on line 385 to:
BLOCKED_ADDR_QUEUE.remove(i);
I can report the the fetch is now much faster with both of these fixes
> Fetcher threads blocked on synchronized block in cleanExpiredServerBlocks
> -------------------------------------------------------------------------
>
> Key: NUTCH-344
> URL: http://issues.apache.org/jira/browse/NUTCH-344
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Affects Versions: 0.8.1, 0.9.0, 0.8
> Environment: All
> Reporter: Greg Kim
> Fix For: 0.8.1, 0.9.0
>
> Attachments: cleanExpiredServerBlocks.patch, HttpBase.patch
>
>
> With the recent change to the following code in HttpBase.java has tendencies
> to block fetcher threads while one thread busy waits...
> private static void cleanExpiredServerBlocks() {
> synchronized (BLOCKED_ADDR_TO_TIME) {
> while (!BLOCKED_ADDR_QUEUE.isEmpty()) { <===== LINE 3:
> String host = (String) BLOCKED_ADDR_QUEUE.getLast();
> long time = ((Long) BLOCKED_ADDR_TO_TIME.get(host)).longValue();
> if (time <= System.currentTimeMillis()) {
> BLOCKED_ADDR_TO_TIME.remove(host);
> BLOCKED_ADDR_QUEUE.removeLast();
> }
> }
> }
> }
> LINE3: As long as there are *any* entries in the BLOCKED_ADDR_QUEUE, the
> thread that first enters this block busy-waits until it becomes empty while
> all other threads block on the synchronized block. This leads to extremely
> poor fetcher performance.
> Since the checkin to respect crawlDelay in robots.txt, we are no longer
> guranteed that BLOCKED_ADDR_TO_TIME queue is a fifo list. The simple fix is
> to iterate the queue once rather than busy waiting...
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira