I've been watching discussion of faster regex libs with much
interest. But if regex speed seems to be a problem, would using less
regexes be a good answer?
Protocol and extension filtering could be done by another URLFilter
plugin that is dedicated to this task, and uses more lightweight
string-chopping techniques. That way full regex support could be
retained for the tasks where it's really needed.
On Mar 13, 2006, at 12:31 PM, Howie Wang wrote:
I have made some quick tests with regex-urlfilter...
The major problem is that it doen't use the Perl syntax...
For instance, ît doesn't support the boundary matchers ^ and $
(which are
used in nutch)
Are there other ways to match start/end of string in the other
regex library? I use "^http" a lot because a lot of sites pass around
urls in the query string, and I don't want them (eg.
http://del.icio.us/howie?url=http://lucene.apache.org/nutch)
Howie
--
Matt Kangas / [EMAIL PROTECTED]