If you want to use "crawl" command, you have to set "crawl-urlfilter.txt". If you want to take advantage of "db.ignore.external.links", you have to go with the "Step-by-Step or Whole-web Crawling" process which is mentioned in http://wiki.apache.org/nutch/NutchTutorial
Justin Yao wrote: > Another workaround is to set "db.ignore.external.links" to "true" in > your nutch-site.xml: > > <property> > <name>db.ignore.external.links</name> > <value>false</value> > <description>If true, outlinks leading from a page to external hosts > will be ignored. This is an effective way to limit the crawl to > include only initially injected hosts, without creating complex URLFilters. > </description> > </property> > > > Tony Wang wrote: >> that helps a lot! thanks! >> >> 2009/3/2 yanky young <[email protected]> >> >>> Hi: >>> >>> I am not an nutch expert though. But I think ur problem is easy. >>> >>> 1. make a list of seed urls in a file under urls folder >>> 2. add all of the domain that you want to crawl to crawl-urlfilter.txt, >>> just >>> like this: >>> >>> # accept hosts in MY.DOMAIN.NAME >>> +^http://([a-z0-9]*\.)*aaa.edu/ >>> +^http://([a-z0-9]*\.)*bbb.edu/ >>> ...... >>> >>> good luck! >>> >>> yanky >>> >>> 2009/3/3 Tony Wang <[email protected]> >>> >>>> Can someone on this list give me some instructions about how to crawl >>>> multiple websites in each run? Should I make a list of websites in the >>> urls >>>> folder? but how to set up the crawl-urlfilter.txt? >>>> >>>> thanks! >>>> >>>> -- >>>> Are you RCholic? www.RCholic.com >>>> 温 良 恭 俭 让 仁 义 礼 智 信 >>>> >> >>
