ML mail wrote:
Thank you very much for this patch, this is great news and I am looking forward
to using it.
Before I just have a small question: I already have an index with around 1 million pages
indexed using Nutch 0.9 and would like first to remove all pages which are not ending
with ".be" from this index. Is this possible and if yes how ? Or should I
better start over again from scratch with a new index using Nutch 1.0 ?
Regards
I think you can use a regex filter with the correct regex, and then do a
crawldb merge using only the single crawldb. Let me know if you need
more info.
Dennis
--- On Sat, 12/13/08, Dennis Kubes <[email protected]> wrote:
From: Dennis Kubes <[email protected]>
Subject: Updated Domain URLFilter
To: [email protected]
Date: Saturday, December 13, 2008, 8:57 AM
An updated patch has been added for the domain urlfilter.
This now includes the matching against domain suffix, domain
name, and hostname in that order.
Dennis