Re: [Nutch-general] Aborting with hung threads

Dennis Kubes Wed, 09 Aug 2006 07:27:51 -0700

I have come up with a temporary hack for this. This is caused by crawldelays in the robots.txt file being set to huge amounts (for example Isaw many that were set to 500 seconds and some as high as 72000 seconds).Attached is a temporary patch. It will get it working but is definitelynot the best solution for this problem. This patch simply sets thecrawl delay to 30 seconds when the crawl delay is longer than 30seconds. Essentially it ignores the settings in the robots.txt filewhich is not a good long term solution. A better solution long termwould be to have a property that allows pages with crawl delays > xnumber of seconds to be ignored.


Dennis


Uroš Gruber wrote:

Hi,

When fetching segment I noticed

"Aborting with 10 hung threads."

in hadoop log. From that point I see a lot of

2006-08-09 10:17:24,072 INFO  mapred.LocalJobRunner - reduce > reduce
2006-08-09 10:17:25,148 INFO  mapred.LocalJobRunner - reduce > reduce
2006-08-09 10:17:26,314 INFO  mapred.LocalJobRunner - reduce > reduce
2006-08-09 10:17:27,523 INFO  mapred.LocalJobRunner - reduce > reduce
2006-08-09 10:17:28,581 INFO  mapred.LocalJobRunner - reduce > reduce
2006-08-09 10:17:29,615 INFO  mapred.LocalJobRunner - reduce > reduce
2006-08-09 10:17:30,629 INFO  mapred.LocalJobRunner - reduce > reduce
2006-08-09 10:17:31,178 INFO  mapred.JobClient -  map 100%  reduce 73%
2006-08-09 10:17:31,640 INFO  mapred.LocalJobRunner - reduce > reduce
2006-08-09 10:17:32,686 INFO  mapred.LocalJobRunner - reduce > reduce
2006-08-09 10:17:33,695 INFO  mapred.LocalJobRunner - reduce > reduce
2006-08-09 10:17:34,737 INFO  mapred.LocalJobRunner - reduce > reduce
2006-08-09 10:17:35,742 INFO  mapred.LocalJobRunner - reduce > reduce
2006-08-09 10:17:36,809 INFO  mapred.LocalJobRunner - reduce > reduce
2006-08-09 10:17:37,865 INFO  mapred.LocalJobRunner - reduce > reduce

messages.

Can somebody explain what is going on and if this is normal for thistaking so much CPU time


 PID USERNAME  THR PRI NICE   SIZE    RES STATE    TIME   WCPU COMMAND
4953 uros        2 132    0  1189M   127M RUN     33:05 99.02% java

--
Uros

Index: 
src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
===================================================================
--- 
src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java   
    (revision 428457)
+++ 
src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java   
    (working copy)
@@ -185,6 +185,13 @@
       
       long crawlDelay = robots.getCrawlDelay(this, u);
       long delay = crawlDelay > 0 ? crawlDelay : serverDelay;
+      if (delay > 30000) {
+        delay = 30000;
+        int numSecondsDelay = (int)(crawlDelay / 1000);
+        LOGGER.info("Someone is setting way to long of a delay value..." + 
+          numSecondsDelay + " seconds");
+      }
+      
       String host = blockAddr(u, delay);
       Response response;
       try {

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642

_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Aborting with hung threads

Reply via email to