>does anyone of you have a "pornfilter" not to fetch those URLs and therefore
>save bandwidth and storage space?
>
>I could do that with regular expressions and the URL-filter, of course. But
>maybe there is another way and somebody already made a plugin for that. Any
>hints would be great.

We have an "adult content" filter that we'll be contributing back to 
Nutch. It uses keywords from the URL, content and meta fields to 
generate a probability value. We then flag pages as probable or 
possible adult, based on ranges. Seems to be working pretty well for 
us, though now we need to replicate & re-tune for poker and drug spam.

Note that this does mean that an adult page does get fetched, but 
where it's a win is in penalizing (via OPIC-style scoring) pages that 
this adult page points to. So we still wind up fetching a lot fewer 
worthless pages.

One potential problem this creates is that a lot of adult sites 
contain links to download various video player software. So some 
high-level pages at Adobe, Microsoft, Apple, etc. wind up getting 
identified as also being "adult" in nature, but since those pages 
aren't part of our focused crawl anyway, it's not a big deal for us.

-- Ken
-- 
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to