John:

Right now if the protocol-http  comes across a 300 header it get the
forwarding URL and it's content. It would be nice to get this URL and have
it go thru the URL processing.

For example, if one has www.oldjunksite.com being rejected, one could
potential create a www.newjunksite.com and have the pages forwarded to the
old site and circumvent the filtering. 

 

-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of John X
Sent: Monday, January 31, 2005 9:46 PM
To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Subject: Re: [Nutch-dev] make URLFilter as plugin

Currently I would just create a barebone infrastructure to make existing
URL-related filtering/processing pluggable, i.e.,
(1) define an interface URLFilter
(2) convert RegexURLFilter.java, PrefixURLFilter.java as plugins.
After that, people can write plugins with more sophistication as you
suggest, either by own invention or by calling commercial lib/engine.

However, I do not quite follow your discussion about 3xx forwards.

John

On Mon, Jan 31, 2005 at 08:03:03PM -0500, Chirag Chaman wrote:
> John:
> 
> This is a very good idea -- and one that we currently use as a "hack"
(i.e.
> very slow)
> 
> Here are a few things that we faced:
> 
> 1. At times we need to reprocess rules. Example:
>       - Run URL filter and remove URL
>       - Run RegexURL filter to transform passed url to another URL
>       - No, it may be required to run URL filter again
> 
>       Thus, have a way to reject in RegexURL would be nice. That would
also 
> make URLFiler redundant
> 
> 2. 3xx forwards -- they seem to get by as the first URL gets recorded.
> There needs to be a way where getting a 3xx forward should junk the 
> old url and start taking the new one or both (user defined). Now the 
> resulting URL should be checked against filters. Thus abilty to call 
> the plugin from protocol-http.
> 
> 3. As rules grow filtering becomes slow -- prior to using Nutch we 
> were using a commercial RETE rules engine in which we have loaded the 
> REs as rules. This improved speed immensely. Maybe an overkill for 
> now.  Below is a simpler way to do this.
> 
> Here's what we're planning on building -- is this helpful? How would 
> this play in with plugins...
> 
> <GROUP> Rule Group Name
> <RULE>
>       <MATCH> RE to match </MATCH>
>       <ACTION> Discard/Substitution/GoTo </ACTION>
>       <SUBSTITUTION> Substitution </SUBSTUTION>
>       <GOTO>RuleGroupToSendProcess</GOTO>
>       <STOP> 0 or 1 - 0 would mean keep processing more rules <STOP> 
> </RULE> </GROUP>
> 
> Here's who this would work.
> 
> -Each file has a "Default" group, under which all rules are kept.
> -For more advanced rules, one could send control to another RuleGroup 
> on match (helpful when you want specific groups of rules for a certain 
> domain, extension, etc) -- this will cut down the number of rules to look
at.
> - the Stop exits upon a match or keeps processing more rules in the 
> same group.
>  
> 
> 
> -----Original Message-----
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED] On Behalf Of 
> John X
> Sent: Monday, January 31, 2005 7:53 PM
> To: [EMAIL PROTECTED]
> Cc: [EMAIL PROTECTED]
> Subject: [Nutch-dev] make URLFilter as plugin
> 
> Hi, All,
> 
> I propose to define plugin extension point for URLFilter, and convert 
> current RegexURLFilter.java, PrefixURLFilter.java, etc., into plugins.
> However there is one requirement, different from other plugin 
> extensions: we should be able to specify the order by which plugins are
loaded and applied.
> I have not checked, but I assume, by default, we can always name 
> plugins in alphabetical order.
> Stefan: any better way to do this?
> 
> If no one thinks this is a bad idea, I am going to start work on it 
> right way.
> 
> John
> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by: IntelliVIEW -- Interactive 
> Reporting Tool for open source databases. Create drag-&-drop reports. 
> Save time by over 75%! Publish reports on the web. Export to DOC, XLS,
RTF, etc.
> Download a FREE copy at http://www.intelliview.com/go/osdn_nl
> _______________________________________________
> Nutch-developers mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/nutch-developers
> 
> 
> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by: IntelliVIEW -- Interactive 
> Reporting Tool for open source databases. Create drag-&-drop reports. 
> Save time by over 75%! Publish reports on the web. Export to DOC, XLS,
RTF, etc.
> Download a FREE copy at http://www.intelliview.com/go/osdn_nl
> _______________________________________________
> Nutch-developers mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/nutch-developers
> 
__________________________________________
http://www.neasys.com - A Good Place to Be Come to visit us today!


-------------------------------------------------------
This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting Tool
for open source databases. Create drag-&-drop reports. Save time by over
75%! Publish reports on the web. Export to DOC, XLS, RTF, etc.
Download a FREE copy at http://www.intelliview.com/go/osdn_nl
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers




-------------------------------------------------------
This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting
Tool for open source databases. Create drag-&-drop reports. Save time
by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc.
Download a FREE copy at http://www.intelliview.com/go/osdn_nl
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to