RegEx Domain/URL matching

mmartinek Sat, 28 May 2011 17:24:05 -0700

Greetings,

I'm new to using Nutch and I'm just jumping right in to develop some
filters. However, I first have a couple questions that I would really
appreciate answers to. Please keep in mind that I have read the
documentation already and reviewed a good portion of the code.


First, are filters loaded and re-used or should I expect a new instance to
be created per URL? There is a very intense loading process that I'm going
to implement and depending on the behavior, I'll either have it cached
statically or just not worry about it at all if the filter is only
instantiated once anyway.

Secondly and with relation to my first question, what sort of performance
does the URL matching provide? Are compiled patterns cached? I have upward
of 3,374,121 regex expressions for URL and domain filtering that will
provide an exceedingly fine granularity of acceptance. 

Basically, if the default URL RegEx filter already provides a cache for
loading RegEx and a cache for that compiled patterns then I should be able
to instead write a script which dumps the RegEx from the database into the
configuration file. Otherwise, I'll implement my own to load them directly
from the database.

Thanks in advance,
Michael

--
View this message in context: 
http://lucene.472066.n3.nabble.com/RegEx-Domain-URL-matching-tp2997679p2997679.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.

RegEx Domain/URL matching

Reply via email to