I've just switched from nutch-0.7.2 to nutch-0.8. I'm attempting to do an intranet crawl of a single site. The setup I've used in nutch-0.7.2 translates well to nutch-0.8 with two exceptions:
1. the crawl is no longer staying within the website. Why? The single text file in my url directory contains the root URL: http://www.psychologymatters.org/ and conf/crawl-urlfilter.txt has one line for accepting hosts: +^http://([a-z0-9]*\.)*psychologymatters.org/ and -. to skip all else. So, why does nutch-0.8 pursue links outside this domain? Here's how I invoke the crawl: nohup bin/nutch crawl url -dir /d01/nutch/psychologymatters9 -depth 9 >& logs/psychologymatters9.log & 2. The other question relates to log files. As you see above, I want to redirect to a log file specific to this crawl. In nutch-0.7.2, it does just that, but with nutch-0.8, all log messages are appended to logs/hadoop.log. How can I change this? - Paul M Lieberman American Psychological Association ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
