Hello,

 Thanks for the reply, but this doesn't seem to work either. I removed the
crawl dir, added the regex you posted, removed the one I had in
regex-urlfilter.txt and crawl-urlfilter.txt and restarted the crawl. My
crawls spend about 90% of their time on who.int .. I have no idea how to
remove this domain or all .int domains from being crawled. Do I have the
regex in the wrong conf file?

Thanks, 

-Warren

reinhard schwab wrote:
> 
> opsec schrieb:
>> I've added this to my conf/crawl-urlfilter.txt and
>> conf/regex-urlfilter.txt
>> yet when I start a crawl this domain is heavily spidered. I would like to
>> remove it from my search results entirely and prevent it from being
>> crawled
>> in the future and possibly all *.int tlds, how can i accomplish this?
>>
>> -^http://([a-z0-9]*\.)*who.int/
>>   
> why not
> 
> -^http://[^/]*\.int/
> 
> 
> 
>> Thanks for your time and any assistance, 
>>
>> -Warren
>>   
> 
> 
> 

-- 
View this message in context: 
http://old.nabble.com/How-do-I-block-ban-a-specific-domain-name-or-a-tld--tp26289091p26306461.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to