On 9/15/06, Paul M Lieberman <[EMAIL PROTECTED]> wrote: > I've just switched from nutch-0.7.2 to nutch-0.8. > > I'm attempting to do an intranet crawl of a single site. The setup I've > used in nutch-0.7.2 translates well to nutch-0.8 with two exceptions: > > 1. the crawl is no longer staying within the website. Why?
> The single text file in my url directory contains the root URL: > http://www.psychologymatters.org/ > and conf/crawl-urlfilter.txt has one line for accepting hosts: > +^http://([a-z0-9]*\.)*psychologymatters.org/ > and > -. > to skip all else. So, why does nutch-0.8 pursue links outside this domain? > Here's how I invoke the crawl: This should work, I can not re produce this bug on 0.8 .. You can also chnage the following property to be true in nutch-site.xml. Is your regex-urlfilter same as the crawl-urlfilter? just wondering Furthermore you might also want to change the following property in nutch-site.xml as well. <property> <name>db.ignore.external.links</name> <value>true</value> <description>If true, outlinks leading from a page to external hosts will be ignored. This is an effective way to limit the crawl to include only initially injected hosts, without creating complex URLFilters. </description> </property> > nohup bin/nutch crawl url -dir /d01/nutch/psychologymatters9 -depth 9 >& > logs/psychologymatters9.log & > > 2. The other question relates to log files. As you see above, I want to > redirect to a log file specific to this crawl. In nutch-0.7.2, it does > just that, but with nutch-0.8, all log messages are appended to > logs/hadoop.log. How can I change this? You need to edit the file conf/log4j.properties. There are bunch of options you can tweak and twist. Please refer to log4j documentation for that. http://logging.apache.org/log4j/docs/documentation.html > - Paul M Lieberman > American Psychological Association > ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
