If you want to use "crawl" command, you have to set "crawl-urlfilter.txt".
If you want to take advantage of "db.ignore.external.links", you have to
go with the "Step-by-Step or Whole-web Crawling" process which is
mentioned in http://wiki.apache.org/nutch/NutchTutorial


Justin Yao wrote:
> Another workaround is to set "db.ignore.external.links" to "true" in
> your nutch-site.xml:
> 
> <property>
>   <name>db.ignore.external.links</name>
>   <value>false</value>
>   <description>If true, outlinks leading from a page to external hosts
>   will be ignored. This is an effective way to limit the crawl to
> include only initially injected hosts, without creating complex URLFilters.
>   </description>
> </property>
> 
> 
> Tony Wang wrote:
>> that helps a lot! thanks!
>>
>> 2009/3/2 yanky young <[email protected]>
>>
>>> Hi:
>>>
>>> I am not an nutch expert though. But I think ur problem is easy.
>>>
>>> 1. make a list of seed urls in a file under urls folder
>>> 2. add all of the domain that you want to crawl to crawl-urlfilter.txt,
>>> just
>>> like this:
>>>
>>> # accept hosts in MY.DOMAIN.NAME
>>> +^http://([a-z0-9]*\.)*aaa.edu/
>>> +^http://([a-z0-9]*\.)*bbb.edu/
>>> ......
>>>
>>> good luck!
>>>
>>> yanky
>>>
>>> 2009/3/3 Tony Wang <[email protected]>
>>>
>>>> Can someone on this list give me some instructions about how to crawl
>>>> multiple websites in each run? Should I make a list of websites in the
>>> urls
>>>> folder? but how to set up the crawl-urlfilter.txt?
>>>>
>>>> thanks!
>>>>
>>>> --
>>>> Are you RCholic? www.RCholic.com
>>>> 温 良 恭 俭 让 仁 义 礼 智 信
>>>>
>>
>>

Reply via email to