> 
> 2) use filter/stop words to remove most profanity
> from
> the index 

This one can be tricky. As you know, if you block all
sites containing profane words, you will also remove
lots of safe sites with useful information. Also, the
fact that objectionable sites can be in any language
complicates things further. There are English words
that have profane meanings in other languages. At a
previous job we had a heuristic-based algorithm to
rate sites with respect to their "cleanliness". It
helped but it was far from perfect. For example, some
sites were tagged as perfectly clean since they
contained sexual pictures but no sexual words.

In a way, the problem is similar to spam detection:
catching 90% of the sites is easy, the other 10% can
take an enormous amount of work.

> a problem.  An example is the common last name
> "Sexton",  if sex was a filter word, would that name
> be filtered out of the index?

A good implementation of a stemming algorithm does not
have this problem since it uses dictionaries and
thesauri to determine word variations. For example,
"dietary" is stemmed to "diet" but not to "die".

Diego.

__________________________________
Do you Yahoo!?
Yahoo! Finance Tax Center - File online. File on time.
http://taxes.yahoo.com/filing.html


-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to