Re: Updated Domain URLFilter

ML mail Tue, 23 Dec 2008 06:23:18 -0800

Dear Dennis

Just to let you know that I had now time to test Nutch 1.0-dev with the Domain 
URLFilter patch. So in order only to index domains with the be TLD I have added 
"be" into the domain-urlfilter.txt file in Nutch's conf directory. I did some 
test crawls up to around 400'000 pages and unfortunately I keep seeing some 
.com domains, for example www.adobe.com which has nothing to do with ".be". If 
I do a search for .com there are around 40'000 pages with end with .com. So is 
there maybe some extra configuration I need to do or something in order to get 
only .be websites indexed ?

Also another question I have noticed that this crawling of around 400'000 pages 
occupies right now 57 GB in space (the segments directory taking mostly all the 
space). But our old Nutch 0.9 with 1'000'000 pages crawled occupies only 13 GB. 
So I was wondering what's the difference between Nutch 0.9 and Nutch 1.0-dev 
that explains this big space occupation difference ?

Best regards

--- On Sat, 12/13/08, Dennis Kubes <[email protected]> wrote:

> From: Dennis Kubes <[email protected]>
> Subject: Updated Domain URLFilter
> To: [email protected]
> Date: Saturday, December 13, 2008, 8:57 AM
> An updated patch has been added for the domain urlfilter. 
> This now includes the matching against domain suffix, domain
> name, and hostname in that order.
> 
> Dennis

Re: Updated Domain URLFilter

Reply via email to