does anyone of you have a "pornfilter" not to fetch those URLs and therefore
save bandwidth and storage space?

I could do that with regular expressions and the URL-filter, of course. But
maybe there is another way and somebody already made a plugin for that. Any
hints would be great.

We have an "adult content" filter that we'll be contributing back to Nutch. It uses keywords from the URL, content and meta fields to generate a probability value. We then flag pages as probable or possible adult, based on ranges. Seems to be working pretty well for us, though now we need to replicate & re-tune for poker and drug spam.

Note that this does mean that an adult page does get fetched, but where it's a win is in penalizing (via OPIC-style scoring) pages that this adult page points to. So we still wind up fetching a lot fewer worthless pages.

One potential problem this creates is that a lot of adult sites contain links to download various video player software. So some high-level pages at Adobe, Microsoft, Apple, etc. wind up getting identified as also being "adult" in nature, but since those pages aren't part of our focused crawl anyway, it's not a big deal for us.

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"

Reply via email to