Just like Shane, I have also considered developing a filtered search engine - one that is child safe.
Please let me know if this is possible:
I am sure it is possible. It is definitley something I think some of the larger engines should be championing.
1) add all sites appearing in the Open Directory adult categories to a "do not index list"
They make the data freely available so all those sites listed could be removed quite easily.
2) use filter/stop words to remove most profanity from the index
Easily done.
(I think there is a workaround: people can use quotes around words search past filter words in the Nutch)
One final question: Is stemming available in Nutch?
Not sure but if not it can easily be done using Perl. There are java implimentations of the porter stemming algo which I have used with some success so it shouldn't be a problem to impliment. Doing it in nutch is another thing altogether, some of the smarter guys on here should be able to answer that one.
There are instances where this can be a good thing or a problem. An example is the common last name "Sexton", if sex was a filter word, would that name be filtered out of the index?
"Sexton" stems to "Sexton" so there is no problem there. There are a few applets online that allow you to stem words, search for "porter stemming java" and you should get a few. You would be suprised at what some words actually stem down to. I am not sure if stemming would be any use in rating a page. You could build a synonym list for words that are unfavourable but there are better methods for ranking pages.
Just curious. I would rather develop an algorithm for scoring the content of a webpage. I know that not all use of the word "sex" is pornographic.
Have a look at Bayesian filtering for this sort of thing. It is currently being used with some success against spam.
Harry
------------------------------------------------------- This SF.Net email is sponsored by: IBM Linux Tutorials Free Linux tutorial presented by Daniel Robbins, President and CEO of GenToo technologies. Learn everything from fundamentals to system administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers
