Thanks Justin. But I would still like to index pages from an external host.
I have a target site for crawling, and that site is an online directory and
rich in resources I want. I would like to have my nutch craw on the
directory and then go to the external pages for further crawling.

I wonder how to achieve this while allowing multiple site crawling?

Thanks a lot!

Tony

2009/3/3 Justin Yao <[email protected]>

> Another workaround is to set "db.ignore.external.links" to "true" in
> your nutch-site.xml:
>
> <property>
>  <name>db.ignore.external.links</name>
>  <value>false</value>
>  <description>If true, outlinks leading from a page to external hosts
>  will be ignored. This is an effective way to limit the crawl to
> include only initially injected hosts, without creating complex URLFilters.
>  </description>
> </property>
>
>
> Tony Wang wrote:
> > that helps a lot! thanks!
> >
> > 2009/3/2 yanky young <[email protected]>
> >
> >> Hi:
> >>
> >> I am not an nutch expert though. But I think ur problem is easy.
> >>
> >> 1. make a list of seed urls in a file under urls folder
> >> 2. add all of the domain that you want to crawl to crawl-urlfilter.txt,
> >> just
> >> like this:
> >>
> >> # accept hosts in MY.DOMAIN.NAME
> >> +^http://([a-z0-9]*\.)*aaa.edu/
> >> +^http://([a-z0-9]*\.)*bbb.edu/
> >> ......
> >>
> >> good luck!
> >>
> >> yanky
> >>
> >> 2009/3/3 Tony Wang <[email protected]>
> >>
> >>> Can someone on this list give me some instructions about how to crawl
> >>> multiple websites in each run? Should I make a list of websites in the
> >> urls
> >>> folder? but how to set up the crawl-urlfilter.txt?
> >>>
> >>> thanks!
> >>>
> >>> --
> >>> Are you RCholic? www.RCholic.com
> >>> 温 良 恭 俭 让 仁 义 礼 智 信
> >>>
> >
> >
> >
>



-- 
Are you RCholic? www.RCholic.com
温 良 恭 俭 让 仁 义 礼 智 信

Reply via email to