Hi Tony,

You could check the "Step-by-Step or Whole-web Crawling" section from
http://wiki.apache.org/nutch/NutchTutorial

Justin

Tony Wang wrote:
> Thanks Justin. But I would still like to index pages from an external host.
> I have a target site for crawling, and that site is an online directory and
> rich in resources I want. I would like to have my nutch craw on the
> directory and then go to the external pages for further crawling.
> 
> I wonder how to achieve this while allowing multiple site crawling?
> 
> Thanks a lot!
> 
> Tony
> 
> 2009/3/3 Justin Yao <[email protected]>
> 
>> Another workaround is to set "db.ignore.external.links" to "true" in
>> your nutch-site.xml:
>>
>> <property>
>>  <name>db.ignore.external.links</name>
>>  <value>false</value>
>>  <description>If true, outlinks leading from a page to external hosts
>>  will be ignored. This is an effective way to limit the crawl to
>> include only initially injected hosts, without creating complex URLFilters.
>>  </description>
>> </property>
>>
>>
>> Tony Wang wrote:
>>> that helps a lot! thanks!
>>>
>>> 2009/3/2 yanky young <[email protected]>
>>>
>>>> Hi:
>>>>
>>>> I am not an nutch expert though. But I think ur problem is easy.
>>>>
>>>> 1. make a list of seed urls in a file under urls folder
>>>> 2. add all of the domain that you want to crawl to crawl-urlfilter.txt,
>>>> just
>>>> like this:
>>>>
>>>> # accept hosts in MY.DOMAIN.NAME
>>>> +^http://([a-z0-9]*\.)*aaa.edu/
>>>> +^http://([a-z0-9]*\.)*bbb.edu/
>>>> ......
>>>>
>>>> good luck!
>>>>
>>>> yanky
>>>>
>>>> 2009/3/3 Tony Wang <[email protected]>
>>>>
>>>>> Can someone on this list give me some instructions about how to crawl
>>>>> multiple websites in each run? Should I make a list of websites in the
>>>> urls
>>>>> folder? but how to set up the crawl-urlfilter.txt?
>>>>>
>>>>> thanks!
>>>>>
>>>>> --
>>>>> Are you RCholic? www.RCholic.com
>>>>> 温 良 恭 俭 让 仁 义 礼 智 信
>>>>>
>>>
>>>
> 
> 
> 

Reply via email to