Re: [Mailman-Developers] GSOC 2013 project discussion

Terri Oda Wed, 17 Apr 2013 09:35:48 -0700


On 13-04-17 10:10 AM, Avik Pal wrote:

Don't lose hope Terri, after digging for a couple of hours came acrossthis and its pretty much updated. http://untroubled.org/spam/

Finding sources of spam (like that one) isn't that hard; it's findingsources of legit email combined with spam and classified and processedin the same way that's challenging. As I said, you can combine a spamsource like this with a publicly available mailing list to make asynthetic set, but scientifically speaking, those aren't reallypreferred ways to handle data because they come from multiple sources.

The problem is that when you have multiple sources it sometimes becomestoo easy for a classifier to classify on less-than-useful features forfuture use. For example, one might classify on the fact that the listaddress won't appear in any of the To: or Cc: lines in the spam databecause it comes from a different source, the fact that many of thespams will be from different time periods, the fact that the spam datais anonymized differently from any list data you might have, etc. Youwill wind up doing a lot of work to normalize the data sets to avoidthese classifiers (and we're talking weeks of really boring work here,potentially, that you need to start Right Now if you're going to beusing such a set), and you run the risk of missing out on features thatwould have been useful in a single-source set that have been completelyobliterated by the synthetic data set.


 Terri

_______________________________________________
Mailman-Developers mailing list
[email protected]
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://wiki.list.org/x/AgA3
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: http://wiki.list.org/x/QIA9

Re: [Mailman-Developers] GSOC 2013 project discussion

Reply via email to