On Saturday, January 8, 2011, 11:55:27 PM, Warren Jr. wrote: > On Sat, Jan 8, 2011 at 8:45 PM, Jeff Chan <[email protected]> wrote: >> On Saturday, January 8, 2011, 10:32:32 PM, Jeff Chan wrote: >> >>> Old corpora may result in incorrect scores being applied current >>> messages. >> >> Er, "applied to"....
> I'm finding many cases where even my 2009 corpora no longer is > representative of modern mail. I want to remove 2007 and 2008 from > the masscheck, but we need more recent ham to make up for it... Thanks Warren, I think a useful research project would be to see how long ham and spam corpora should be kept or used. Both will have relevant lifetimes. If the samples have dates, it ought to be possible to run an experiment programatically and see how age affects usefulness. For example, group samples by age and see how they score against current rules and data sources. Then choose cutoffs in the ages based on the results. It should be possible to determine automatically and objectively based on performance, just like rules are scored automatically. Testing samples by age should be a much simpler problem though. Cheers, Jeff C.
