ML mail wrote:
Thank you very much for this patch, this is great news and I am looking forward 
to using it.

Before I just have a small question: I already have an index with around 1 million pages 
indexed using Nutch 0.9 and would like first to remove all pages which are not ending 
with ".be" from this index. Is this possible and if yes how ? Or should I 
better start over again from scratch with a new index using Nutch 1.0 ?

Regards

I think you can use a regex filter with the correct regex, and then do a crawldb merge using only the single crawldb. Let me know if you need more info.

Dennis




--- On Sat, 12/13/08, Dennis Kubes <[email protected]> wrote:

From: Dennis Kubes <[email protected]>
Subject: Updated Domain URLFilter
To: [email protected]
Date: Saturday, December 13, 2008, 8:57 AM
An updated patch has been added for the domain urlfilter. This now includes the matching against domain suffix, domain
name, and hostname in that order.

Dennis


Reply via email to