Avik Pal writes:

 > Meanwhile It would be much appreciated if someone can direct me to
 > an labeled dataset available on line.

By "labelled" you mean pre-classified into spam vs ham?  I see you
already found one, but you could also check the SpamBayes and
SpamAssassin distributions.

 > Here I have a suggestion, after submitting, whenever an email is
 > classified as Spam, we store it in a separate archive and after the
 > end of the day send them a mail telling "this is the digest for all
 > the mails that Mailman thinks to be Spam" the subscriber may go
 > there and can view them and also can mark them as not Spam,

I suggest that you present this as an option for users who want to
tune the filters, and as something that can be used pre-release to
develop the initial parameters for the distributed classifier.
Although Bayesian classifiers do offer the option to train or tune
your personal classifier on a local corpus, most users just stick with
the distribution parameters plus self-training.  It's pretty effective
(surprisingly so to me).  I guess the logic is that spammers aren't
terribly creative.

 > Emails which stays as Spam will be dropped after a month

Let's think carefully about that.  Everybody deletes the spam; that's
why you started by asking for a labelled dataset, because nobody keeps
one around.  Somebody really ought to do the public service of
collecting a corpus.  Of course, if you do arrange to keep it around,
it's going to need to be an option that sites and list owners can
disable.

_______________________________________________
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://wiki.list.org/x/AgA3
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: http://wiki.list.org/x/QIA9

Reply via email to