There's the DBUrlFilter as well, that stores the Whitelist in the database: http://issues.apache.org/jira/browse/NUTCH-100
It performs better than the PrefixURLFilter and also makes the management of the list more easy. Rgrds, Thomas On 3/15/06, Matt Kangas <[EMAIL PROTECTED]> wrote: > > For a large whitelist filtered by hostname, you should use > PrefixURLFilter. (built-in to 0.7) > > If you wanted to apply regex rules to the paths of these sites, you > could use my WhitelistURLFilter (http://issues.apache.org/jira/browse/ > NUTCH-87). But it sounds like you don't quite need that. > > Cheers, > --Matt > > On Mar 15, 2006, at 2:50 PM, Insurance Squared Inc. wrote: > > > Hi All, > > > > We're merrily proceeding down our route of a country specific > > search engine, nutch seems to be working well. However we're > > finding some sites creeping in that aren't from our country. > > Specifically, we automatically allow in sites that are hosted > > within the country. We're finding more sites than we'd like hosted > > here that are actually owned/operated in another country and thus > > not relevant. I'd like to get rid of these if I can. > > > > Is there a viable way of using nutch 0.7 using only a whitelist of > > sites - and a very large whitelist at that (say 500K to a million+ > > sites, all in one whitelist)? If not, is it possible in nutch > > 0.8? That way I can just find other ways of adding known-to-be- > > good sites into the white list over time. > > > > (fwiw, we automatically allow our specific country TLD, then > > for .com/.net/.org we only allow if the site is physically hosted > > here by checking an IP list. If other country search engine folks > > have comments on a better way to do this I'd welcome the input.). > > -- > Matt Kangas / [EMAIL PROTECTED] > > >