Re: Searching only a whitelist (country specific SE)

Matt Kangas Mon, 20 Mar 2006 16:50:43 -0800

Aha, I see the misunderstanding. PrefixURLFilter uses a class calledTrieStringMatcher. A "trie" is a data structure that stores anassociative array with string-valued keys. Keys are stored in adecomposed form: an ordered tree of the key string's characters.


http://en.wikipedia.org/wiki/Trie


The actual class used is:

http://lucene.apache.org/nutch/apidocs/org/apache/nutch/util/TrieStringMatcher.html

For a URL of length K and N patterns to test against, only Kcharacter tests are performed. Performance should be similar to aHashMap: not exactly O(1), but close. With 500k domains to matchagainst, PrefixURLFilter should still be reasonably good.


This is definitely not the case for RegexURLFilter, 'tho. ;)


On Mar 20, 2006, at 3:41 AM, TDLN wrote:

Yes, a HashMap underlies the cache.

Isn't the main difference that in case of the PrefixURLFilter, theurl that

has to be tested, is matched against every single url pattern in the

regex-urlfilter file (500k patterns in this case)? Only in case oneuses thecache route, it's a lookup of a single entry. If the entry isfound, the url

passes. I can imagine this is actually much faster.

I my particular scenario, I am also storing a "category" with every

permitted domain in the database. The category is stored as thevalue in the

hashmap and used to add a category field to the index.

Rgrds, Thomas



On 3/19/06, Matt Kangas <[EMAIL PROTECTED]> wrote:


I'm still curious how this compares to PrefixURLFilter. If you go the
"load all domains" route, I don't see why you wouldn't just dump the
DB data into a flat text file and feed this to PrefixURLFilter.

(Also, the trie underlying PrefixURLFilter should consume less RAM
than the hashmap presumably underlying your cache, while still
delivering similar lookup speed. But perhaps I'm wrong?)

--Matt

On Mar 19, 2006, at 1:09 PM, TDLN wrote:

I agree with you. That was a bold statement, not necessarily backed
up by
any hard evidence that I can provide you with.

The DBUrlFilter can be adapted though so that it loads all domains
in the
database into the cache only once. In case of a cache miss, the
plugin does
not go to the database anymore, but rejects the url. The only thing
to think
about is to make the cache big enough to hold all domains in the
database.

In this case the DBUrlFilter performs better, but I have no
comparison with
the PrefixURLFilter.

Rgrds, Thomas




On 3/19/06, Matt Kangas <[EMAIL PROTECTED]> wrote:


I'm curious how this "performs better than PrefixURLFilter".
Management, yes, but performance? According to the description on

NUTCH-100, you go to the database for every cache miss. Thisimplies

that filter hits are cheap, whereas misses are expensive. (tcp/ip
roundtrip, etc)

Can you please explain?

--Matt

On Mar 19, 2006, at 3:13 AM, TDLN wrote:

There's the DBUrlFilter as well, that stores the Whitelist in the
database:
http://issues.apache.org/jira/browse/NUTCH-100

It performs better than the PrefixURLFilter and also makes the
management of
the list more easy.

Rgrds, Thomas

On 3/15/06, Matt Kangas <[EMAIL PROTECTED]> wrote:

For a large whitelist filtered by hostname, you should use
PrefixURLFilter. (built-in to 0.7)
If you wanted to apply regex rules to the paths of thesesites, you
could use my WhitelistURLFilter (http://issues.apache.org/jira/
browse/
NUTCH-87). But it sounds like you don't quite need that.

Cheers,
--Matt

On Mar 15, 2006, at 2:50 PM, Insurance Squared Inc. wrote:
Hi All,

We're merrily proceeding down our route of a country specific
search engine, nutch seems to be working well.  However we're
finding some sites creeping in that aren't from our country.
Specifically, we automatically allow in sites that are hosted
within the country.  We're finding more sites than we'd like
hosted
here that are actually owned/operated in another country andthus
not relevant.  I'd like to get rid of these if I can.
Is there a viable way of using nutch 0.7 using only awhitelist ofsites - and a very large whitelist at that (say 500K to amillion+
sites, all in one whitelist)?  If not, is it possible in nutch
0.8?  That way I can just find other ways of adding known-to-be-
good sites into the white list over time.

(fwiw, we automatically allow our specific country TLD, then
for .com/.net/.org we only allow if the site is physicallyhostedhere by checking an IP list. If other country search enginefolkshave comments on a better way to do this I'd welcome theinput.).


--
Matt Kangas / [EMAIL PROTECTED]


--
Matt Kangas / [EMAIL PROTECTED]

Re: Searching only a whitelist (country specific SE)

Reply via email to