thanks for good tips yanky
2009/3/4 Justin Yao <[email protected]> > Another workaround is to set "db.ignore.external.links" to "true" in > your nutch-site.xml: > > <property> > <name>db.ignore.external.links</name> > <value>false</value> > <description>If true, outlinks leading from a page to external hosts > will be ignored. This is an effective way to limit the crawl to > include only initially injected hosts, without creating complex URLFilters. > </description> > </property> > > > Tony Wang wrote: > > that helps a lot! thanks! > > > > 2009/3/2 yanky young <[email protected]> > > > >> Hi: > >> > >> I am not an nutch expert though. But I think ur problem is easy. > >> > >> 1. make a list of seed urls in a file under urls folder > >> 2. add all of the domain that you want to crawl to crawl-urlfilter.txt, > >> just > >> like this: > >> > >> # accept hosts in MY.DOMAIN.NAME > >> +^http://([a-z0-9]*\.)*aaa.edu/ > >> +^http://([a-z0-9]*\.)*bbb.edu/ > >> ...... > >> > >> good luck! > >> > >> yanky > >> > >> 2009/3/3 Tony Wang <[email protected]> > >> > >>> Can someone on this list give me some instructions about how to crawl > >>> multiple websites in each run? Should I make a list of websites in the > >> urls > >>> folder? but how to set up the crawl-urlfilter.txt? > >>> > >>> thanks! > >>> > >>> -- > >>> Are you RCholic? www.RCholic.com > >>> 温 良 恭 俭 让 仁 义 礼 智 信 > >>> > > > > > > >
