Dear Dennis Just to let you know that I had now time to test Nutch 1.0-dev with the Domain URLFilter patch. So in order only to index domains with the be TLD I have added "be" into the domain-urlfilter.txt file in Nutch's conf directory. I did some test crawls up to around 400'000 pages and unfortunately I keep seeing some .com domains, for example www.adobe.com which has nothing to do with ".be". If I do a search for .com there are around 40'000 pages with end with .com. So is there maybe some extra configuration I need to do or something in order to get only .be websites indexed ?
Also another question I have noticed that this crawling of around 400'000 pages occupies right now 57 GB in space (the segments directory taking mostly all the space). But our old Nutch 0.9 with 1'000'000 pages crawled occupies only 13 GB. So I was wondering what's the difference between Nutch 0.9 and Nutch 1.0-dev that explains this big space occupation difference ? Best regards --- On Sat, 12/13/08, Dennis Kubes <[email protected]> wrote: > From: Dennis Kubes <[email protected]> > Subject: Updated Domain URLFilter > To: [email protected] > Date: Saturday, December 13, 2008, 8:57 AM > An updated patch has been added for the domain urlfilter. > This now includes the matching against domain suffix, domain > name, and hostname in that order. > > Dennis
