--- Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
> There is - if it's an HTML page, add HTMLFilter. If > it's other type of > content, I'm afraid there is no general > post-processing hook to add plugins. I'll check that out! Thanks for pointing me to this. > > I'd like to also look at bayesian filtering during > the > > parse phase to look for hidden font (text same > color > > as background) and spammy pages or for sites with > 3+ > > adsense ads or other particulars and score > > appropriately. > > > > Has anyone experiemented with this? > > > > Again, HTMLFilters is the place to add such things. > > Now, an interesting thing would be to keep this > categorization around, > so that next time you can skip/demote pages, which > are known as spam. > This is the purpose of the "CrawlDatum metadata" > patch... coming soon, I > hope :-) That's what i'm waiting (Rather excited) for :) Looking to initially flag adult related pages, but use existing filtering processing to look for patterns to flag as spam as well. -byron ------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
