> > 2) use filter/stop words to remove most profanity > from > the index
This one can be tricky. As you know, if you block all sites containing profane words, you will also remove lots of safe sites with useful information. Also, the fact that objectionable sites can be in any language complicates things further. There are English words that have profane meanings in other languages. At a previous job we had a heuristic-based algorithm to rate sites with respect to their "cleanliness". It helped but it was far from perfect. For example, some sites were tagged as perfectly clean since they contained sexual pictures but no sexual words. In a way, the problem is similar to spam detection: catching 90% of the sites is easy, the other 10% can take an enormous amount of work. > a problem. An example is the common last name > "Sexton", if sex was a filter word, would that name > be filtered out of the index? A good implementation of a stemming algorithm does not have this problem since it uses dictionaries and thesauri to determine word variations. For example, "dietary" is stemmed to "diet" but not to "die". Diego. __________________________________ Do you Yahoo!? Yahoo! Finance Tax Center - File online. File on time. http://taxes.yahoo.com/filing.html ------------------------------------------------------- This SF.Net email is sponsored by: IBM Linux Tutorials Free Linux tutorial presented by Daniel Robbins, President and CEO of GenToo technologies. Learn everything from fundamentals to system administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers
