Hi!

We've used Nutch as part of a project where we've got a link database
containing about 5000 unique URLs. The projected needed to crawl all
those links and all pages stored below them.

To do this I tried to use the RegexURLFilter, but unfortunately it does
not scale when you have lots of different links. In this case I ended up
with 5000 regular expressions that had to be evaluated for each new link
found. It clearly did not scale.

I therefore wrote a new URLFilter that did prefix matching, something
that works nicely for the above task. In CVS I found the
PrefixStringMatcher class which was perfect for implementing the
PrefixURLFilter class. The performance of the trie string matcher is
very good compared to regexp evaluation and the memory usage is also
minimal.

You'll find the source code attached to this posting. Feel free to apply
changes and also include it in CVS if you'd like.

All the best,
Geir O.

Attachment: PrefixURLFilter.java
Description: Binary data

Reply via email to