There's the DBUrlFilter as well, that stores the Whitelist in the database:
http://issues.apache.org/jira/browse/NUTCH-100

It performs better than the PrefixURLFilter and also makes the management of
the list more easy.

Rgrds, Thomas

On 3/15/06, Matt Kangas <[EMAIL PROTECTED]> wrote:
>
> For a large whitelist filtered by hostname, you should use
> PrefixURLFilter. (built-in to 0.7)
>
> If you wanted to apply regex rules to the paths of these sites, you
> could use my WhitelistURLFilter (http://issues.apache.org/jira/browse/
> NUTCH-87). But it sounds like you don't quite need that.
>
> Cheers,
> --Matt
>
> On Mar 15, 2006, at 2:50 PM, Insurance Squared Inc. wrote:
>
> > Hi All,
> >
> > We're merrily proceeding down our route of a country specific
> > search engine, nutch seems to be working well.  However we're
> > finding some sites creeping in that aren't from our country.
> > Specifically, we automatically allow in sites that are hosted
> > within the country.  We're finding more sites than we'd like hosted
> > here that are actually owned/operated in another country and thus
> > not relevant.  I'd like to get rid of these if I can.
> >
> > Is there a viable way of using nutch 0.7 using only a whitelist of
> > sites - and a very large whitelist at that (say 500K to a million+
> > sites, all in one whitelist)?  If not, is it possible in nutch
> > 0.8?  That way I can just find other ways of adding known-to-be-
> > good sites into the white list over time.
> >
> > (fwiw, we automatically allow our specific country TLD, then
> > for .com/.net/.org we only allow if the site is physically hosted
> > here by checking an IP list.  If other country search engine folks
> > have comments on a better way to do this I'd welcome the input.).
>
> --
> Matt Kangas / [EMAIL PROTECTED]
>
>
>

Reply via email to