For a large whitelist filtered by hostname, you should use
PrefixURLFilter. (built-in to 0.7)
If you wanted to apply regex rules to the paths of these sites, you
could use my WhitelistURLFilter (http://issues.apache.org/jira/browse/
NUTCH-87). But it sounds like you don't quite need that.
Cheers,
--Matt
On Mar 15, 2006, at 2:50 PM, Insurance Squared Inc. wrote:
Hi All,
We're merrily proceeding down our route of a country specific
search engine, nutch seems to be working well. However we're
finding some sites creeping in that aren't from our country.
Specifically, we automatically allow in sites that are hosted
within the country. We're finding more sites than we'd like hosted
here that are actually owned/operated in another country and thus
not relevant. I'd like to get rid of these if I can.
Is there a viable way of using nutch 0.7 using only a whitelist of
sites - and a very large whitelist at that (say 500K to a million+
sites, all in one whitelist)? If not, is it possible in nutch
0.8? That way I can just find other ways of adding known-to-be-
good sites into the white list over time.
(fwiw, we automatically allow our specific country TLD, then
for .com/.net/.org we only allow if the site is physically hosted
here by checking an IP list. If other country search engine folks
have comments on a better way to do this I'd welcome the input.).
--
Matt Kangas / [EMAIL PROTECTED]