--- Andrzej Bialecki <[EMAIL PROTECTED]> wrote:

> There is - if it's an HTML page, add HTMLFilter. If
> it's other type of 
> content, I'm afraid there is no general
> post-processing hook to add plugins.

I'll check that out! Thanks for pointing me to this.


> > I'd like to also look at bayesian filtering during
> the
> > parse phase to look for hidden font (text same
> color
> > as background) and spammy pages or for sites with
> 3+
> > adsense ads or other particulars and score
> > appropriately.
> >
> > Has anyone experiemented with this?
> >   
> 
> Again, HTMLFilters is the place to add such things.
> 
> Now, an interesting thing would be to keep this
> categorization around, 
> so that next time you can skip/demote pages, which
> are known as spam. 
> This is the purpose of the "CrawlDatum metadata"
> patch... coming soon, I 
> hope :-)

That's what i'm waiting (Rather excited) for :)

Looking to initially flag adult related pages, but use
existing filtering processing to look for patterns to
flag as spam as well.

-byron


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to