>does anyone of you have a "pornfilter" not to fetch those URLs and therefore >save bandwidth and storage space? > >I could do that with regular expressions and the URL-filter, of course. But >maybe there is another way and somebody already made a plugin for that. Any >hints would be great.
We have an "adult content" filter that we'll be contributing back to Nutch. It uses keywords from the URL, content and meta fields to generate a probability value. We then flag pages as probable or possible adult, based on ranges. Seems to be working pretty well for us, though now we need to replicate & re-tune for poker and drug spam. Note that this does mean that an adult page does get fetched, but where it's a win is in penalizing (via OPIC-style scoring) pages that this adult page points to. So we still wind up fetching a lot fewer worthless pages. One potential problem this creates is that a lot of adult sites contain links to download various video player software. So some high-level pages at Adobe, Microsoft, Apple, etc. wind up getting identified as also being "adult" in nature, but since those pages aren't part of our focused crawl anyway, it's not a big deal for us. -- Ken -- Ken Krugler Krugle, Inc. +1 530-210-6378 "Find Code, Find Answers" ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
