Thanks! This looks very useful! I've added it into CVS.
Doug
Geir Ove Gr�nmo wrote:
Hi!
We've used Nutch as part of a project where we've got a link database containing about 5000 unique URLs. The projected needed to crawl all those links and all pages stored below them.
To do this I tried to use the RegexURLFilter, but unfortunately it does not scale when you have lots of different links. In this case I ended up with 5000 regular expressions that had to be evaluated for each new link found. It clearly did not scale.
I therefore wrote a new URLFilter that did prefix matching, something that works nicely for the above task. In CVS I found the PrefixStringMatcher class which was perfect for implementing the PrefixURLFilter class. The performance of the trie string matcher is very good compared to regexp evaluation and the memory usage is also minimal.
You'll find the source code attached to this posting. Feel free to apply changes and also include it in CVS if you'd like.
All the best, Geir O.
-------------------------------------------------------
This SF.Net email is sponsored by: Oracle 10g
Get certified on the hottest thing ever to hit the market... Oracle 10g. Take an Oracle 10g class now, and we'll give you the exam FREE.
http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers
