Re: Domain Url Filtering

Renaud Richardet Thu, 02 Aug 2007 12:01:51 -0700

hi Vince,

have you tried this property?


<property>
 <name>db.ignore.external.links</name>
 <value>false</value>
 <description>If true, outlinks leading from a page to external hosts
 will be ignored. This is an effective way to limit the crawl to include
 only initially injected hosts, without creating complex URLFilters.
 </description>
</property>

HTH,
Renaud



Vince Filby wrote:

Hello,

Is there a way to tell Nutch to only follow links within the domain it is
currently crawling?

What I would like to do is pass a list of Url's and Nutch should ignore all
outbound links from any domain other than the domain that the link comes
from.  Let's say that I am crawling www.test1.com, I should only follow
links to www.test1.com.

I realize that I can do this with regex filter *if* I add a regex rule for
*each* site that I want to crawl, this solution doesn't scale well for my
project.  I have also read of a db based url filter that will maintain a
list of accepted url's in a database.  This also doesn't fit well since I
don't want to maintain the crawl list and the accepted domain database.  I
can but it is rather clunky.

I have poked around the source and it looks like the url filtering mechanism
only passes the link url and returns a url.  So it appears that this is not
really possible at the code level without source modifications.  I would
just like to confirm that I am not missing anything obvious before I start
reworking the code.

Cheers,
Vince

Re: Domain Url Filtering

Reply via email to