ML mail wrote:
Dear Dennis

Just to let you know that I had now time to test Nutch 1.0-dev with the Domain URLFilter patch. So 
in order only to index domains with the be TLD I have added "be" into the 
domain-urlfilter.txt file in Nutch's conf directory. I did some test crawls up to around 400'000 
pages and unfortunately I keep seeing some .com domains, for example www.adobe.com which has 
nothing to do with ".be". If I do a search for .com there are around 40'000 pages with 
end with .com. So is there maybe some extra configuration I need to do or something in order to get 
only .be websites indexed ?


Weird, I added a unit test similar to what you described and it was successfully excluded. I had a domain file like this:

        net
        apache.org
        be
        www.yahoo.com

And Unit tests like this:

        assertNotNull(domainFilter.filter("http://lucene.apache.org";));
        assertNotNull(domainFilter.filter("http://hadoop.apache.org";));
        assertNotNull(domainFilter.filter("http://www.apache.org";));
        assertNull(domainFilter.filter("http://www.google.com";));
        assertNull(domainFilter.filter("http://mail.yahoo.com";));
        assertNotNull(domainFilter.filter("http://www.foobar.net";));
        assertNotNull(domainFilter.filter("http://www.foobas.net";));
        assertNotNull(domainFilter.filter("http://www.yahoo.com";));
        assertNotNull(domainFilter.filter("http://www.foobar.be";));
        assertNull(domainFilter.filter("http://www.adobe.com";));

I am going to go ahead and commit this code. If we still see errors popping up we will need to revisit it. Truthfully though I don't know how an error such as you are describing could happen as all this filter does is match suffix, domain name, and hostname. There is no regex stuff happening.

Also another question I have noticed that this crawling of around 400'000 pages 
occupies right now 57 GB in space (the segments directory taking mostly all the 
space). But our old Nutch 0.9 with 1'000'000 pages crawled occupies only 13 GB. 
So I was wondering what's the difference between Nutch 0.9 and Nutch 1.0-dev 
that explains this big space occupation difference ?


I know there are changes in the CrawlDb structure and metadata. Don't know why there would be that big of a difference unless you

1) set the max content fetched higher
2) you were just fetching a set of pages that had more content
3) you set redirects to > 0 so redirects were fetched and you were actually fetching a lot more pages.

Dennis

Best regards


--- On Sat, 12/13/08, Dennis Kubes <[email protected]> wrote:

From: Dennis Kubes <[email protected]>
Subject: Updated Domain URLFilter
To: [email protected]
Date: Saturday, December 13, 2008, 8:57 AM
An updated patch has been added for the domain urlfilter. This now includes the matching against domain suffix, domain
name, and hostname in that order.

Dennis


Reply via email to