does anyone of you have a "pornfilter" not to fetch those URLs and therefore
save bandwidth and storage space?
I could do that with regular expressions and the URL-filter, of course. But
maybe there is another way and somebody already made a plugin for that. Any
hints would be great.
We have an "adult content" filter that we'll be contributing back to
Nutch. It uses keywords from the URL, content and meta fields to
generate a probability value. We then flag pages as probable or
possible adult, based on ranges. Seems to be working pretty well for
us, though now we need to replicate & re-tune for poker and drug spam.
Note that this does mean that an adult page does get fetched, but
where it's a win is in penalizing (via OPIC-style scoring) pages that
this adult page points to. So we still wind up fetching a lot fewer
worthless pages.
One potential problem this creates is that a lot of adult sites
contain links to download various video player software. So some
high-level pages at Adobe, Microsoft, Apple, etc. wind up getting
identified as also being "adult" in nature, but since those pages
aren't part of our focused crawl anyway, it's not a big deal for us.
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"