Another workaround is to set "db.ignore.external.links" to "true" in
your nutch-site.xml:

<property>
  <name>db.ignore.external.links</name>
  <value>false</value>
  <description>If true, outlinks leading from a page to external hosts
  will be ignored. This is an effective way to limit the crawl to
include only initially injected hosts, without creating complex URLFilters.
  </description>
</property>


Tony Wang wrote:
> that helps a lot! thanks!
> 
> 2009/3/2 yanky young <[email protected]>
> 
>> Hi:
>>
>> I am not an nutch expert though. But I think ur problem is easy.
>>
>> 1. make a list of seed urls in a file under urls folder
>> 2. add all of the domain that you want to crawl to crawl-urlfilter.txt,
>> just
>> like this:
>>
>> # accept hosts in MY.DOMAIN.NAME
>> +^http://([a-z0-9]*\.)*aaa.edu/
>> +^http://([a-z0-9]*\.)*bbb.edu/
>> ......
>>
>> good luck!
>>
>> yanky
>>
>> 2009/3/3 Tony Wang <[email protected]>
>>
>>> Can someone on this list give me some instructions about how to crawl
>>> multiple websites in each run? Should I make a list of websites in the
>> urls
>>> folder? but how to set up the crawl-urlfilter.txt?
>>>
>>> thanks!
>>>
>>> --
>>> Are you RCholic? www.RCholic.com
>>> 温 良 恭 俭 让 仁 义 礼 智 信
>>>
> 
> 
> 

Reply via email to