Hi, > I think we've seen this discussion on the list before > (so Christopher, check the archives!)
Quite :-) > > The problems that others have experienced in the past are: > > - what happens with "mis"spellings, e.g. "fsck"? > > - what happens with dodgy formatting, e.g "f s c k"? > > - what happens with words like "Scunthorpe"? > > Problem 1: add likely/popular mis-spellings to the list of > vulger/vulgar language So when I'm giving a Linux user advice on how to recover from a disk crash, my "run fsck" comment will get trapped.... the problem here is that context is *everything*. You just can't know, by seeing the word "fsck" without any of the surrounding text, whether I'm swearing at another geek or helping them out :-) There will also be problems with slang and idiom - e.g. "fag" in .uk is a cigarette, but it's something quite different on the other side of the pond. Again, this can only be judged from the context. Finally, the more words you have in your list (to cover common misspellings), the more likely you are to get a false positive (again, context) - and you *will* cause offense if you trap someone's name, for example. > Problem 2: (contrived) very few single-letter words exist so remove > intervening white space prior to analysis Yup, also line breaks, dashes, asterisks, plus signs, etc etc :-) > Problem 3: Scunthorpe contains an unfortunate series of letters (amongst the > town's many disadvantages) however the critical four are not a word in and > of their own right so employ whitespace (\s) in the RegEx or token analysis. That's a good solution, but it's something that obviously is being missed by many developers of this sort of algorithm... see the couple of followups I made immediately after my original response. > > May I suggest, rather than picking your way through this minefield, you > > provide a "report abusive comment" link instead? > > However some countries are now legislating responsibility that > ISPs/employers must discharge Whoops, forgot about that... > In this case perhaps one could analyse the incoming text and place an > embargo on its publication on the web site until it has been reviewed by a > human editor? Looks like the best solution possible. If the OP is interested I will see if I can get our content filter word list from the network manager here... no promises though. Cheers Jon -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php