I have come up with a temporary hack for this. This is caused by crawl delays in the robots.txt file being set to huge amounts (for example I saw many that were set to 500 seconds and some as high as 72000 seconds). Attached is a temporary patch. It will get it working but is definitely not the best solution for this problem. This patch simply sets the crawl delay to 30 seconds when the crawl delay is longer than 30 seconds. Essentially it ignores the settings in the robots.txt file which is not a good long term solution. A better solution long term would be to have a property that allows pages with crawl delays > x number of seconds to be ignored.

Dennis

Uroš Gruber wrote:
Hi,

When fetching segment I noticed

"Aborting with 10 hung threads."

in hadoop log. From that point I see a lot of

2006-08-09 10:17:24,072 INFO  mapred.LocalJobRunner - reduce > reduce
2006-08-09 10:17:25,148 INFO  mapred.LocalJobRunner - reduce > reduce
2006-08-09 10:17:26,314 INFO  mapred.LocalJobRunner - reduce > reduce
2006-08-09 10:17:27,523 INFO  mapred.LocalJobRunner - reduce > reduce
2006-08-09 10:17:28,581 INFO  mapred.LocalJobRunner - reduce > reduce
2006-08-09 10:17:29,615 INFO  mapred.LocalJobRunner - reduce > reduce
2006-08-09 10:17:30,629 INFO  mapred.LocalJobRunner - reduce > reduce
2006-08-09 10:17:31,178 INFO  mapred.JobClient -  map 100%  reduce 73%
2006-08-09 10:17:31,640 INFO  mapred.LocalJobRunner - reduce > reduce
2006-08-09 10:17:32,686 INFO  mapred.LocalJobRunner - reduce > reduce
2006-08-09 10:17:33,695 INFO  mapred.LocalJobRunner - reduce > reduce
2006-08-09 10:17:34,737 INFO  mapred.LocalJobRunner - reduce > reduce
2006-08-09 10:17:35,742 INFO  mapred.LocalJobRunner - reduce > reduce
2006-08-09 10:17:36,809 INFO  mapred.LocalJobRunner - reduce > reduce
2006-08-09 10:17:37,865 INFO  mapred.LocalJobRunner - reduce > reduce

messages.

Can somebody explain what is going on and if this is normal for this taking so much CPU time

 PID USERNAME  THR PRI NICE   SIZE    RES STATE    TIME   WCPU COMMAND
4953 uros        2 132    0  1189M   127M RUN     33:05 99.02% java

--
Uros
Index: 
src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
===================================================================
--- 
src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java   
    (revision 428457)
+++ 
src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java   
    (working copy)
@@ -185,6 +185,13 @@
       
       long crawlDelay = robots.getCrawlDelay(this, u);
       long delay = crawlDelay > 0 ? crawlDelay : serverDelay;
+      if (delay > 30000) {
+        delay = 30000;
+        int numSecondsDelay = (int)(crawlDelay / 1000);
+        LOGGER.info("Someone is setting way to long of a delay value..." + 
+          numSecondsDelay + " seconds");
+      }
+      
       String host = blockAddr(u, delay);
       Response response;
       try {
-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to