I agree with you. That was a bold statement, not necessarily backed up by
any hard evidence that I can provide you with.

The DBUrlFilter can be adapted though so that it loads all domains in the
database into the cache only once. In case of a cache miss, the plugin does
not go to the database anymore, but rejects the url. The only thing to think
about is to make the cache big enough to hold all domains in the database.

In this case the DBUrlFilter performs better, but I have no comparison with
the PrefixURLFilter.

Rgrds, Thomas














On 3/19/06, Matt Kangas <[EMAIL PROTECTED]> wrote:
>
> I'm curious how this "performs better than PrefixURLFilter".
> Management, yes, but performance? According to the description on
> NUTCH-100, you go to the database for every cache miss. This implies
> that filter hits are cheap, whereas misses are expensive. (tcp/ip
> roundtrip, etc)
>
> Can you please explain?
>
> --Matt
>
> On Mar 19, 2006, at 3:13 AM, TDLN wrote:
>
> > There's the DBUrlFilter as well, that stores the Whitelist in the
> > database:
> > http://issues.apache.org/jira/browse/NUTCH-100
> >
> > It performs better than the PrefixURLFilter and also makes the
> > management of
> > the list more easy.
> >
> > Rgrds, Thomas
> >
> > On 3/15/06, Matt Kangas <[EMAIL PROTECTED]> wrote:
> >>
> >> For a large whitelist filtered by hostname, you should use
> >> PrefixURLFilter. (built-in to 0.7)
> >>
> >> If you wanted to apply regex rules to the paths of these sites, you
> >> could use my WhitelistURLFilter (http://issues.apache.org/jira/
> >> browse/
> >> NUTCH-87). But it sounds like you don't quite need that.
> >>
> >> Cheers,
> >> --Matt
> >>
> >> On Mar 15, 2006, at 2:50 PM, Insurance Squared Inc. wrote:
> >>
> >>> Hi All,
> >>>
> >>> We're merrily proceeding down our route of a country specific
> >>> search engine, nutch seems to be working well.  However we're
> >>> finding some sites creeping in that aren't from our country.
> >>> Specifically, we automatically allow in sites that are hosted
> >>> within the country.  We're finding more sites than we'd like hosted
> >>> here that are actually owned/operated in another country and thus
> >>> not relevant.  I'd like to get rid of these if I can.
> >>>
> >>> Is there a viable way of using nutch 0.7 using only a whitelist of
> >>> sites - and a very large whitelist at that (say 500K to a million+
> >>> sites, all in one whitelist)?  If not, is it possible in nutch
> >>> 0.8?  That way I can just find other ways of adding known-to-be-
> >>> good sites into the white list over time.
> >>>
> >>> (fwiw, we automatically allow our specific country TLD, then
> >>> for .com/.net/.org we only allow if the site is physically hosted
> >>> here by checking an IP list.  If other country search engine folks
> >>> have comments on a better way to do this I'd welcome the input.).
> >>
> >> --
> >> Matt Kangas / [EMAIL PROTECTED]
> >>
> >>
> >>
>
> --
> Matt Kangas / [EMAIL PROTECTED]
>
>
>

Reply via email to