Thanks Justin. But I would still like to index pages from an external host. I have a target site for crawling, and that site is an online directory and rich in resources I want. I would like to have my nutch craw on the directory and then go to the external pages for further crawling.
I wonder how to achieve this while allowing multiple site crawling? Thanks a lot! Tony 2009/3/3 Justin Yao <[email protected]> > Another workaround is to set "db.ignore.external.links" to "true" in > your nutch-site.xml: > > <property> > <name>db.ignore.external.links</name> > <value>false</value> > <description>If true, outlinks leading from a page to external hosts > will be ignored. This is an effective way to limit the crawl to > include only initially injected hosts, without creating complex URLFilters. > </description> > </property> > > > Tony Wang wrote: > > that helps a lot! thanks! > > > > 2009/3/2 yanky young <[email protected]> > > > >> Hi: > >> > >> I am not an nutch expert though. But I think ur problem is easy. > >> > >> 1. make a list of seed urls in a file under urls folder > >> 2. add all of the domain that you want to crawl to crawl-urlfilter.txt, > >> just > >> like this: > >> > >> # accept hosts in MY.DOMAIN.NAME > >> +^http://([a-z0-9]*\.)*aaa.edu/ > >> +^http://([a-z0-9]*\.)*bbb.edu/ > >> ...... > >> > >> good luck! > >> > >> yanky > >> > >> 2009/3/3 Tony Wang <[email protected]> > >> > >>> Can someone on this list give me some instructions about how to crawl > >>> multiple websites in each run? Should I make a list of websites in the > >> urls > >>> folder? but how to set up the crawl-urlfilter.txt? > >>> > >>> thanks! > >>> > >>> -- > >>> Are you RCholic? www.RCholic.com > >>> 温 良 恭 俭 让 仁 义 礼 智 信 > >>> > > > > > > > -- Are you RCholic? www.RCholic.com 温 良 恭 俭 让 仁 义 礼 智 信
