Hi! We've used Nutch as part of a project where we've got a link database containing about 5000 unique URLs. The projected needed to crawl all those links and all pages stored below them.
To do this I tried to use the RegexURLFilter, but unfortunately it does not scale when you have lots of different links. In this case I ended up with 5000 regular expressions that had to be evaluated for each new link found. It clearly did not scale. I therefore wrote a new URLFilter that did prefix matching, something that works nicely for the above task. In CVS I found the PrefixStringMatcher class which was perfect for implementing the PrefixURLFilter class. The performance of the trie string matcher is very good compared to regexp evaluation and the memory usage is also minimal. You'll find the source code attached to this posting. Feel free to apply changes and also include it in CVS if you'd like. All the best, Geir O.
PrefixURLFilter.java
Description: Binary data
