Re: Updated Domain URLFilter

Dennis Kubes Sat, 13 Dec 2008 16:18:54 -0800


ML mail wrote:

Thank you very much for this patch, this is great news and I am looking forward 
to using it.

Before I just have a small question: I already have an index with around 1 million pages 
indexed using Nutch 0.9 and would like first to remove all pages which are not ending 
with ".be" from this index. Is this possible and if yes how ? Or should I 
better start over again from scratch with a new index using Nutch 1.0 ?

Regards

I think you can use a regex filter with the correct regex, and then do acrawldb merge using only the single crawldb. Let me know if you needmore info.


Dennis




--- On Sat, 12/13/08, Dennis Kubes <[email protected]> wrote:

From: Dennis Kubes <[email protected]>
Subject: Updated Domain URLFilter
To: [email protected]
Date: Saturday, December 13, 2008, 8:57 AM
An updated patch has been added for the domain urlfilter.This now includes the matching against domain suffix, domain
name, and hostname in that order.

Dennis

Re: Updated Domain URLFilter

Reply via email to