[Nutch-general] nutch-0.8 intranet crawls & logs

Paul M Lieberman Fri, 15 Sep 2006 13:54:22 -0700

I've just switched from nutch-0.7.2 to nutch-0.8.

I'm attempting to do an intranet crawl of a single site. The setup I've 
used in nutch-0.7.2 translates well to nutch-0.8 with two exceptions:


1. the crawl is no longer staying within the website. Why?

The single text file in my url directory contains the root URL:
http://www.psychologymatters.org/
and conf/crawl-urlfilter.txt has one line for accepting hosts:
+^http://([a-z0-9]*\.)*psychologymatters.org/

and
-.
to skip all else. So, why does nutch-0.8 pursue links outside this domain?
Here's how I invoke the crawl:
nohup bin/nutch crawl url -dir /d01/nutch/psychologymatters9 -depth 9 >& 
logs/psychologymatters9.log &

2. The other question relates to log files. As you see above, I want to 
redirect to a log file specific to this crawl. In nutch-0.7.2, it does 
just that, but with nutch-0.8, all log messages are appended to 
logs/hadoop.log. How can I change this?

- Paul M Lieberman
American Psychological Association

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] nutch-0.8 intranet crawls & logs

Reply via email to