Hi,

> I think we've seen this discussion on the list before
> (so Christopher, check the archives!)

Quite :-)

> > The problems that others have experienced in the past are:
> > - what happens with "mis"spellings, e.g. "fsck"?
> > - what happens with dodgy formatting, e.g "f s c k"?
> > - what happens with words like "Scunthorpe"?
> 
> Problem 1: add likely/popular mis-spellings to the list of 
> vulger/vulgar language

So when I'm giving a Linux user advice on how to recover from a disk crash,
my "run fsck" comment will get trapped.... the problem here is that context
is *everything*. You just can't know, by seeing the word "fsck" without any
of the surrounding text, whether I'm swearing at another geek or helping
them out :-)

There will also be problems with slang and idiom - e.g. "fag" in .uk is a
cigarette, but it's something quite different on the other side of the pond.
Again, this can only be judged from the context.

Finally, the more words you have in your list (to cover common
misspellings), the more likely you are to get a false positive (again,
context) - and you *will* cause offense if you trap someone's name, for
example.

> Problem 2: (contrived) very few single-letter words exist so remove
> intervening white space prior to analysis

Yup, also line breaks, dashes, asterisks, plus signs, etc etc :-)

> Problem 3: Scunthorpe contains an unfortunate series of letters (amongst
the
> town's many disadvantages) however the critical four are not a word in and
> of their own right so employ whitespace (\s) in the RegEx or token
analysis.

That's a good solution, but it's something that obviously is being missed by
many developers of this sort of algorithm... see the couple of followups I
made immediately after my original response.

> > May I suggest, rather than picking your way through this minefield, you
> > provide a "report abusive comment" link instead?
> 
> However some countries are now legislating responsibility that 
> ISPs/employers must discharge 

Whoops, forgot about that... 

> In this case perhaps one could analyse the incoming text and place an
> embargo on its publication on the web site until it has been reviewed by a
> human editor?

Looks like the best solution possible.

If the OP is interested I will see if I can get our content filter word list
from the network manager here... no promises though.

Cheers
Jon

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to