Re: [Nutch-dev] PrefixURLFilter - filtering URLs by prefix

Doug Cutting Thu, 20 May 2004 12:28:35 -0700

Thanks!  This looks very useful!  I've added it into CVS.

Doug

Geir Ove Gr�nmo wrote:

Hi!

We've used Nutch as part of a project where we've got a link database
containing about 5000 unique URLs. The projected needed to crawl all
those links and all pages stored below them.

To do this I tried to use the RegexURLFilter, but unfortunately it does
not scale when you have lots of different links. In this case I ended up
with 5000 regular expressions that had to be evaluated for each new link
found. It clearly did not scale.

I therefore wrote a new URLFilter that did prefix matching, something
that works nicely for the above task. In CVS I found the
PrefixStringMatcher class which was perfect for implementing the
PrefixURLFilter class. The performance of the trie string matcher is
very good compared to regexp evaluation and the memory usage is also
minimal.

You'll find the source code attached to this posting. Feel free to apply
changes and also include it in CVS if you'd like.

All the best,
Geir O.

------------------------------------------------------- This SF.Net email is sponsored by: Oracle 10g Get certified on the hottest thing ever to hit the market... Oracle 10g. Take an Oracle 10g class now, and we'll give you the exam FREE. http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] PrefixURLFilter - filtering URLs by prefix

Reply via email to