Another workaround is to set "db.ignore.external.links" to "true" in your nutch-site.xml:
<property> <name>db.ignore.external.links</name> <value>false</value> <description>If true, outlinks leading from a page to external hosts will be ignored. This is an effective way to limit the crawl to include only initially injected hosts, without creating complex URLFilters. </description> </property> Tony Wang wrote: > that helps a lot! thanks! > > 2009/3/2 yanky young <[email protected]> > >> Hi: >> >> I am not an nutch expert though. But I think ur problem is easy. >> >> 1. make a list of seed urls in a file under urls folder >> 2. add all of the domain that you want to crawl to crawl-urlfilter.txt, >> just >> like this: >> >> # accept hosts in MY.DOMAIN.NAME >> +^http://([a-z0-9]*\.)*aaa.edu/ >> +^http://([a-z0-9]*\.)*bbb.edu/ >> ...... >> >> good luck! >> >> yanky >> >> 2009/3/3 Tony Wang <[email protected]> >> >>> Can someone on this list give me some instructions about how to crawl >>> multiple websites in each run? Should I make a list of websites in the >> urls >>> folder? but how to set up the crawl-urlfilter.txt? >>> >>> thanks! >>> >>> -- >>> Are you RCholic? www.RCholic.com >>> 温 良 恭 俭 让 仁 义 礼 智 信 >>> > > >
