Hi Tony, You could check the "Step-by-Step or Whole-web Crawling" section from http://wiki.apache.org/nutch/NutchTutorial
Justin Tony Wang wrote: > Thanks Justin. But I would still like to index pages from an external host. > I have a target site for crawling, and that site is an online directory and > rich in resources I want. I would like to have my nutch craw on the > directory and then go to the external pages for further crawling. > > I wonder how to achieve this while allowing multiple site crawling? > > Thanks a lot! > > Tony > > 2009/3/3 Justin Yao <[email protected]> > >> Another workaround is to set "db.ignore.external.links" to "true" in >> your nutch-site.xml: >> >> <property> >> <name>db.ignore.external.links</name> >> <value>false</value> >> <description>If true, outlinks leading from a page to external hosts >> will be ignored. This is an effective way to limit the crawl to >> include only initially injected hosts, without creating complex URLFilters. >> </description> >> </property> >> >> >> Tony Wang wrote: >>> that helps a lot! thanks! >>> >>> 2009/3/2 yanky young <[email protected]> >>> >>>> Hi: >>>> >>>> I am not an nutch expert though. But I think ur problem is easy. >>>> >>>> 1. make a list of seed urls in a file under urls folder >>>> 2. add all of the domain that you want to crawl to crawl-urlfilter.txt, >>>> just >>>> like this: >>>> >>>> # accept hosts in MY.DOMAIN.NAME >>>> +^http://([a-z0-9]*\.)*aaa.edu/ >>>> +^http://([a-z0-9]*\.)*bbb.edu/ >>>> ...... >>>> >>>> good luck! >>>> >>>> yanky >>>> >>>> 2009/3/3 Tony Wang <[email protected]> >>>> >>>>> Can someone on this list give me some instructions about how to crawl >>>>> multiple websites in each run? Should I make a list of websites in the >>>> urls >>>>> folder? but how to set up the crawl-urlfilter.txt? >>>>> >>>>> thanks! >>>>> >>>>> -- >>>>> Are you RCholic? www.RCholic.com >>>>> 温 良 恭 俭 让 仁 义 礼 智 信 >>>>> >>> >>> > > >
