On 13-04-17 10:10 AM, Avik Pal wrote:
Don't lose hope Terri, after digging for a couple of hours came across this and its pretty much updated. http://untroubled.org/spam/
Finding sources of spam (like that one) isn't that hard; it's finding sources of legit email combined with spam and classified and processed in the same way that's challenging. As I said, you can combine a spam source like this with a publicly available mailing list to make a synthetic set, but scientifically speaking, those aren't really preferred ways to handle data because they come from multiple sources.
The problem is that when you have multiple sources it sometimes becomes too easy for a classifier to classify on less-than-useful features for future use. For example, one might classify on the fact that the list address won't appear in any of the To: or Cc: lines in the spam data because it comes from a different source, the fact that many of the spams will be from different time periods, the fact that the spam data is anonymized differently from any list data you might have, etc. You will wind up doing a lot of work to normalize the data sets to avoid these classifiers (and we're talking weeks of really boring work here, potentially, that you need to start Right Now if you're going to be using such a set), and you run the risk of missing out on features that would have been useful in a single-source set that have been completely obliterated by the synthetic data set.
Terri _______________________________________________ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://wiki.list.org/x/AgA3 Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://wiki.list.org/x/QIA9