2009/5/29 Merlin Morgenstern <merli...@fastmail.fm>:
> Per Jessen wrote:
>> Merlin Morgenstern wrote:
>>> Hi there,
>>> I am matching text against an array of keywords to detect spam.
>>> Unfortunatelly there are some false positives due to the fact that
>>> stripos also finds the keyword inside a word.
>>> E.G. "Bewerbung" -> "Werbung"
>>> First thought: use strpos, but this does not help in all cases
>>> Second thought: split text into words and use in_array, but this does
>>> not find things like "zu Hause" or "flexible/Arbeit"
>> First thought - use Spamassassin.
>> Second thought - use regexes.
> sorry this is a different scneario. I do need to to it this way in my case.
> It is about spam inside user postings.
> Any ideas?
I've had to solve this problem before and the conclusion I came to is
that when doing this kind of simple matching you either accept false
positives or false negatives. Alternatives include implementing
Bayesian filtering or some other algorithm that's more complex than
simple matching or use a pre-existing solution.
I'm sure you could integrate SpamAssassin or similar because at the
end of the day all those systems expect is a bunch of text. If they
require the headers of an email you can supply fake ones and remove
any effect headers have on the score. Whether that's worth it depends
on the volume your talking about and how much manual moderation checks
you want to have to do.
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php