Re: [Mailman-Developers] GSOC 2013 project discussion

Terri Oda Wed, 17 Apr 2013 11:29:14 -0700

I'm glad you're somewhat aware of the issues. I frequently encounterfolk who aren't aware of the issues in machine learning, so your "don'tlose hope" email set off all kinds of warning bells in my head.


Going back to GSoC-specific stuff:


- Enron is a very old data set

- If you're going to use it, you need to be prepared to defend thatchoice. I'm not sure it's a choice that can be defended at all, knowingthe field. It's probably not only an old data set, but a completelycounter-productive one given the space in which Mailman operates.


So here's some things to think about:

(1) I want some justification of how this is going to be relevant to theproblem you're trying to solve, which is "helping classify spam emailssent to a mailing list that the MTA was unable to classify"

(2) Many existing classifiers that run at the MTA level have alreadyused the enron data set, so chances are any features you learn willeither already have been incorporated. I have severe concerns that anynew features you learn will result in over-fitting. How can you believethat yet another classifier trained on the same data will be worth theprocessing overhead and resulting delays in mail delivery when it seemslikely that any improvement will be incremental at best?

(3) Enron is not going to help you make use of any list-specificfeatures. How can you use this data set to produce something that isuseful to Mailman, going beyond what any MTA-level spam filter can do?(Note that we've been telling people to do spam filtering at the MTAlevel for years and years and years; justifying this is not going to bean easy task)

(4) If you're going to do cross-validation with other data to makeclaims that the final classifier will be relevant to list data, how isthat data going to be obtained, processed, and used?

(5) Unless you've got a plan for making extensive use of the fact thatyou're classifying mailing list data and not general email, you'repretty much wasting our time since we are only interested in projectsrelevant to Mailman.

To be completely honest, I'm still seeing "student project for datamining class" level thinking here, and that's not going to be goodenough for us. Especially considering that you didn't even know aboutthe most common data sets for this problem, I'm concerned that youhaven't yet reached the skill and experience necessary for us toseriously consider a classifier as even a small part of a GSoC project.We have to give priority to students who we are convinced can finishtheir projects, and it seems like there's too many chances of yougetting stuck on finding data and using it correctly on a problem thatis actually meaningful to Mailman and not just a general classificationtask.


 Terri


On 13-04-17 10:51 AM, Avik Pal wrote:

ya I get your point, but see these are part of any machine learningproject, and feature extraction has to be done considering thesynthetic data set.
On 17 April 2013 22:05, Terri Oda <[email protected]<mailto:[email protected]>> wrote:
    Finding sources of spam (like that one) isn't that hard; it's
    finding sources of legit email combined with spam and classified
    and processed in the same way that's challenging.  As I said, you
    can combine a spam source like this with a publicly available
    mailing list to make a synthetic set, but scientifically speaking,
    those aren't really preferred ways to handle data because they
    come from multiple sources.
well in this regard the only thing I can do is keep looking, I amalso aware that coming from different sources can make them skewed butagain these things are never perfect and there are always scope forbetterment, I think that our aim should be to implement a rudimentaryclassifier with fairly good performance to start with.


_______________________________________________
Mailman-Developers mailing list
[email protected]
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://wiki.list.org/x/AgA3
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: http://wiki.list.org/x/QIA9

Re: [Mailman-Developers] GSOC 2013 project discussion

Reply via email to