Currently I would just create a barebone infrastructure to make existing URL-related filtering/processing pluggable, i.e., (1) define an interface URLFilter (2) convert RegexURLFilter.java, PrefixURLFilter.java as plugins. After that, people can write plugins with more sophistication as you suggest, either by own invention or by calling commercial lib/engine.
However, I do not quite follow your discussion about 3xx forwards. John On Mon, Jan 31, 2005 at 08:03:03PM -0500, Chirag Chaman wrote: > John: > > This is a very good idea -- and one that we currently use as a "hack" (i.e. > very slow) > > Here are a few things that we faced: > > 1. At times we need to reprocess rules. Example: > - Run URL filter and remove URL > - Run RegexURL filter to transform passed url to another URL > - No, it may be required to run URL filter again > > Thus, have a way to reject in RegexURL would be nice. That would > also make URLFiler redundant > > 2. 3xx forwards -- they seem to get by as the first URL gets recorded. > There needs to be a way where getting a 3xx forward should junk the old url > and start taking the new one or both (user defined). Now the resulting URL > should be checked against filters. Thus abilty to call the plugin from > protocol-http. > > 3. As rules grow filtering becomes slow -- prior to using Nutch we were > using a commercial RETE rules engine in which we have loaded the REs as > rules. This improved speed immensely. Maybe an overkill for now. Below is a > simpler way to do this. > > Here's what we're planning on building -- is this helpful? How would this > play in with plugins... > > <GROUP> Rule Group Name > <RULE> > <MATCH> RE to match </MATCH> > <ACTION> Discard/Substitution/GoTo </ACTION> > <SUBSTITUTION> Substitution </SUBSTUTION> > <GOTO>RuleGroupToSendProcess</GOTO> > <STOP> 0 or 1 - 0 would mean keep processing more rules <STOP> > </RULE> > </GROUP> > > Here's who this would work. > > -Each file has a "Default" group, under which all rules are kept. > -For more advanced rules, one could send control to another RuleGroup on > match (helpful when you want specific groups of rules for a certain domain, > extension, etc) -- this will cut down the number of rules to look at. > - the Stop exits upon a match or keeps processing more rules in the same > group. > > > > -----Original Message----- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of John X > Sent: Monday, January 31, 2005 7:53 PM > To: [EMAIL PROTECTED] > Cc: [EMAIL PROTECTED] > Subject: [Nutch-dev] make URLFilter as plugin > > Hi, All, > > I propose to define plugin extension point for URLFilter, and convert > current RegexURLFilter.java, PrefixURLFilter.java, etc., into plugins. > However there is one requirement, different from other plugin extensions: we > should be able to specify the order by which plugins are loaded and applied. > I have not checked, but I assume, by default, we can always name plugins in > alphabetical order. > Stefan: any better way to do this? > > If no one thinks this is a bad idea, I am going to start work on it right > way. > > John > > > ------------------------------------------------------- > This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting Tool > for open source databases. Create drag-&-drop reports. Save time by over > 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc. > Download a FREE copy at http://www.intelliview.com/go/osdn_nl > _______________________________________________ > Nutch-developers mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/nutch-developers > > > > > ------------------------------------------------------- > This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting > Tool for open source databases. Create drag-&-drop reports. Save time > by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc. > Download a FREE copy at http://www.intelliview.com/go/osdn_nl > _______________________________________________ > Nutch-developers mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/nutch-developers > __________________________________________ http://www.neasys.com - A Good Place to Be Come to visit us today! ------------------------------------------------------- This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting Tool for open source databases. Create drag-&-drop reports. Save time by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc. Download a FREE copy at http://www.intelliview.com/go/osdn_nl _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
