On 2/2/2010 10:33 AM, Маллиндайн Стивен (Steve Mallindine) wrote:
> There's an option (in v2 at least) to remove duplicate entries from
> the spam/not spam folders.
>
> But if memory serves, duplicate (identical) messages won't harm the
> Bayesian corpus... It's looking for word patterns... So if the same
> pattern appears in identical message bodies, but from different
> senders, why should that matter?
>
> Steve
>

Won't the bad words become weighted higher because they will be more 
frequent? Plus, eventually all the identical messages are going to 
overwrite the other ones, removing those bad words completely. I thought 
that was the idea behind having bomb tests, to prevent tons of identical 
spam from corrupting the corpus. But I'm not arguing anything should 
change, the likely hood of this happening is properly very minimal and 
I'd much rather have DNSBL running first if it saves on having to 
download the entire email, and thus saving resources.


------------------------------------------------------------------------------
The Planet: dedicated and managed hosting, cloud storage, colocation
Stay online with enterprise data centers and the best network in the business
Choose flexible plans and management services without long-term contracts
Personal 24x7 support from experience hosting pros just a phone call away.
http://p.sf.net/sfu/theplanet-com
_______________________________________________
Assp-test mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/assp-test

Reply via email to