I have come up with a temporary hack for this. This is caused by crawl
delays in the robots.txt file being set to huge amounts (for example I
saw many that were set to 500 seconds and some as high as 72000 seconds).
Attached is a temporary patch. It will get it working but is definitely
not the best solution for this problem. This patch simply sets the
crawl delay to 30 seconds when the crawl delay is longer than 30
seconds. Essentially it ignores the settings in the robots.txt file
which is not a good long term solution. A better solution long term
would be to have a property that allows pages with crawl delays > x
number of seconds to be ignored.
Dennis
Uroš Gruber wrote:
Hi,
When fetching segment I noticed
"Aborting with 10 hung threads."
in hadoop log. From that point I see a lot of
2006-08-09 10:17:24,072 INFO mapred.LocalJobRunner - reduce > reduce
2006-08-09 10:17:25,148 INFO mapred.LocalJobRunner - reduce > reduce
2006-08-09 10:17:26,314 INFO mapred.LocalJobRunner - reduce > reduce
2006-08-09 10:17:27,523 INFO mapred.LocalJobRunner - reduce > reduce
2006-08-09 10:17:28,581 INFO mapred.LocalJobRunner - reduce > reduce
2006-08-09 10:17:29,615 INFO mapred.LocalJobRunner - reduce > reduce
2006-08-09 10:17:30,629 INFO mapred.LocalJobRunner - reduce > reduce
2006-08-09 10:17:31,178 INFO mapred.JobClient - map 100% reduce 73%
2006-08-09 10:17:31,640 INFO mapred.LocalJobRunner - reduce > reduce
2006-08-09 10:17:32,686 INFO mapred.LocalJobRunner - reduce > reduce
2006-08-09 10:17:33,695 INFO mapred.LocalJobRunner - reduce > reduce
2006-08-09 10:17:34,737 INFO mapred.LocalJobRunner - reduce > reduce
2006-08-09 10:17:35,742 INFO mapred.LocalJobRunner - reduce > reduce
2006-08-09 10:17:36,809 INFO mapred.LocalJobRunner - reduce > reduce
2006-08-09 10:17:37,865 INFO mapred.LocalJobRunner - reduce > reduce
messages.
Can somebody explain what is going on and if this is normal for this
taking so much CPU time
PID USERNAME THR PRI NICE SIZE RES STATE TIME WCPU COMMAND
4953 uros 2 132 0 1189M 127M RUN 33:05 99.02% java
--
Uros
Index:
src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
===================================================================
---
src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
(revision 428457)
+++
src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
(working copy)
@@ -185,6 +185,13 @@
long crawlDelay = robots.getCrawlDelay(this, u);
long delay = crawlDelay > 0 ? crawlDelay : serverDelay;
+ if (delay > 30000) {
+ delay = 30000;
+ int numSecondsDelay = (int)(crawlDelay / 1000);
+ LOGGER.info("Someone is setting way to long of a delay value..." +
+ numSecondsDelay + " seconds");
+ }
+
String host = blockAddr(u, delay);
Response response;
try {
-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general