ML mail wrote:
Dear Dennis
Just to let you know that I had now time to test Nutch 1.0-dev with the Domain URLFilter patch. So
in order only to index domains with the be TLD I have added "be" into the
domain-urlfilter.txt file in Nutch's conf directory. I did some test crawls up to around 400'000
pages and unfortunately I keep seeing some .com domains, for example www.adobe.com which has
nothing to do with ".be". If I do a search for .com there are around 40'000 pages with
end with .com. So is there maybe some extra configuration I need to do or something in order to get
only .be websites indexed ?
Weird, I added a unit test similar to what you described and it was
successfully excluded. I had a domain file like this:
net
apache.org
be
www.yahoo.com
And Unit tests like this:
assertNotNull(domainFilter.filter("http://lucene.apache.org"));
assertNotNull(domainFilter.filter("http://hadoop.apache.org"));
assertNotNull(domainFilter.filter("http://www.apache.org"));
assertNull(domainFilter.filter("http://www.google.com"));
assertNull(domainFilter.filter("http://mail.yahoo.com"));
assertNotNull(domainFilter.filter("http://www.foobar.net"));
assertNotNull(domainFilter.filter("http://www.foobas.net"));
assertNotNull(domainFilter.filter("http://www.yahoo.com"));
assertNotNull(domainFilter.filter("http://www.foobar.be"));
assertNull(domainFilter.filter("http://www.adobe.com"));
I am going to go ahead and commit this code. If we still see errors
popping up we will need to revisit it. Truthfully though I don't know
how an error such as you are describing could happen as all this filter
does is match suffix, domain name, and hostname. There is no regex
stuff happening.
Also another question I have noticed that this crawling of around 400'000 pages
occupies right now 57 GB in space (the segments directory taking mostly all the
space). But our old Nutch 0.9 with 1'000'000 pages crawled occupies only 13 GB.
So I was wondering what's the difference between Nutch 0.9 and Nutch 1.0-dev
that explains this big space occupation difference ?
I know there are changes in the CrawlDb structure and metadata. Don't
know why there would be that big of a difference unless you
1) set the max content fetched higher
2) you were just fetching a set of pages that had more content
3) you set redirects to > 0 so redirects were fetched and you were
actually fetching a lot more pages.
Dennis
Best regards
--- On Sat, 12/13/08, Dennis Kubes <[email protected]> wrote:
From: Dennis Kubes <[email protected]>
Subject: Updated Domain URLFilter
To: [email protected]
Date: Saturday, December 13, 2008, 8:57 AM
An updated patch has been added for the domain urlfilter.
This now includes the matching against domain suffix, domain
name, and hostname in that order.
Dennis