Hello Kevin, Looking forward to hear from you soon ! Is there any other mode of communication that we can use ?? What do you suggest ? I guess Skype or Google hangout should solve the issue.
Cheers, Sarang On Mon, Apr 6, 2015 at 3:07 PM, Sarang Shrivastava <[email protected]> wrote: > Hello Kevin, > > Sorry for the late reply, was out of town actually.Well the answers to > your queries are ( I have tried answering them but I am not sure about > their correctness ) :- > > So presently SA uses Bayesian classifier together with some additional DNS > filters to check for Spam. > Firstly the present SA doesn't use any of the neural nets model that I > have listed in my proposal.Secondly the new words that are not present in > the Bayes database, SA assigns a very high probability to it. But there are > chances that together with some garbage there can be a meaning full message > along with it. > > Regarding the plugins, what I think is going with multiple plugins within > a module named statistical classifier. So , this module will contain all > the plugins for the models that I have listed together. By default, The > model that gives the best result out of the listed models will be on. But > the user will be given the flexibility to choose from a range of > statistical plugins to choose from, so that if in the future any additional > methods together with present methods give a better result, so that plugin > can be switched on. > > Well the neural nets that I have listed in my proposal seems to be better > on paper because of the following reasons:- > > 1) NB is not so scalable algorithm for large number of emails and where > most of the words are random nouns. > > 2) NB is a good approach to classify emails but it doesn't do well at all > in front of good unsupervised learning because size of the email > doesn't really determine its importance, even small e mails can be spam > free. > > For example, > > a) we are considering you for a job .. > > b) and an urgent job is posted for you .. > > Now sentence number one might be from a possible employer with whom you > have applied whereas sentence two is 90% from those spammers which send > random job requirements to people > The second one falls more into the category of a spam, So if the user > classifies it as a spam, then the weightage of the combine "urgent job" > will be more than the rest of the features, But in case of Bayesian > filtering each and ebery features are considered independent. The grammar > and language features are not included with the Bayesian filtering but with > the neural nets a lot of things can come into action. > > A thought which you would like to consider: > > I didn't listed SVM in my proposal because of the fact that while I was > attached with RSPAMD, there were two separate ideas, one was for > implementing the supervised neural nets, the other was of unsupervised SVM, > SA project has neither of them, So what I was thinking that considering SVM > also is not a bad option, because in cases where frequency of words matter > SVM give the best result. > > > > > > > > On Sun, Apr 5, 2015 at 2:16 AM, Kevin A. McGrail <[email protected]> > wrote: > >> Thanks Sarang. I got your email to my address as well but it's a >> holiday weekend for me in the states. (Happy Easter to all those who >> celebrate!) >> >> It looks to me like you understand that programming is a state of mind >> not a language which is good and you are capable of switching gears. >> >> I will sign up as a mentor on Monday and let you know when that is done. >> >> From there, you can look at the SA code and answer the basic questions >> below because to me your proposal needs clarification to switch to this >> project. There is a lot of information in the proposal I don't grok so I >> think the basic high level questions for you are: >> >> - What does SA have now related to your proposal? >> - What you propose? A plugin? Multiple plugins? >> - Why is this anticipated to be better than what exists now. >> >> Regards, >> KAM >> >> >> >> >> >> On 4/4/2015 4:17 PM, Sarang Shrivastava wrote: >> >> >> ---------- Forwarded message ---------- >> From: Sarang Shrivastava <[email protected]> >> Date: Sat, Apr 4, 2015 at 11:15 AM >> Subject: Re: LOOKING OUT FOR A MENTOR FOR GSOC 2015 >> To: "Kevin A. McGrail" <[email protected]> >> >> >> Hi Kevin, >> >> Before I came in contact with Rspamd I didn't knew lua at all, but >> within a week I was proficient enough so that I could atleast be able to >> understand the part written in lua (in the rspamd source code). As you know >> necessity is the mother of all inventions, learning perl and redis would >> not be a hurdle. >> >> I was just worried about the fact that first of all I need to look up >> for mentor, and now when I have one with me (hopefully you seem to be >> interested) , so starting from today itself I will dig more into the source >> code of SA and brush upon my perl and redis skills. >> >> Regarding the dataset What I plan is : >> >> Firstly I could directly use the famous enron dataset for spam filters >> :- >> http://www.aueb.gr/users/ion/data/enron-spam/ >> >> Secondly one more thing can be done, I take the spam dataset from : >> http://untroubled.org/spam/ >> which has a collection of spams from 1998-2011 and take the ham dataset >> from my own mail account by importing my or for the matter of fact anyones >> mails from the gmail server. >> https://www.mattcutts.com/blog/backup-gmail-in-linux-with-getmail/ >> >> I'll set up my development environment today itself . I didn't got one >> of your questions "Additionally, what resources do you have to develop and >> test this code on ?". By this did you meant that where would I test my >> code, for that initially I would just work upon the test data and directly >> take input from the dataset in my perl script ( which I would be writing) . >> Or if SA has any testing framework I could use that and test my script. >> Or If I need to write the unit tests myself - that could be done but it >> would be better if there is some framework that I could use. >> >> Just a thought, >> While going through the SA source code I came across a script for that >> said "This is the general class used to train a learning classifier with >> new samples of spam and ham mail, and classify based on prior training." in >> its comments. >> But I guess this is primarily for Bayesian filtering. >> If this is the case I can design a similar script for my testing >> purpose. >> >> One more thing , once I am done with the coding part , I can just put a >> off the filter on the other rules that SA uses to filter spams and then in >> turn just put on the the filter for my code. This would guarantee that >> everything is working fine and then I would have to focus just on improving >> the performance of the filtering process. >> >> So what I plan for the upcoming week is to take a deeper look into the >> SA source code ( The part where Bayesian filtering is implemented ) and >> meanwhile learning perl and redis side by side. >> >> What else do you want me to do ? Your suggestions are most welcome and >> would help me to have a better understanding about the SA project and how >> to get things done. >> >> Cheers, >> Sarang >> >> On Fri, Apr 3, 2015 at 11:47 PM, Kevin A. McGrail <[email protected]> >> wrote: >> >>> Hi Sarang, >>> >>> I've mentored in past GSOCs so I'm interested in helping you but I am >>> concerned about your proposal and the SpamAssassin project. So I can't >>> sign off on it as-is but I'd like to see if we can fix that. >>> >>> The SA project is built on plugins primarily in perl. I didn't see perl >>> or Redis in your proficiencies which I have no doubt you can learn but I'd >>> like to know more about your plans with that. >>> >>> You also mentioned a data set and I'm not sure what data set you plan to >>> use for testing. Additionally, what resources do you have to develop and >>> test this code on? These may be simple or difficult hurdles but they merit >>> attention. >>> >>> Just replacing spamassassin where rspamd exists doesn't really mean the >>> Project Proposal is ready to go because of things like the plugin language >>> (not lua), etc. >>> >>> Can you look at SA and delve a bit more into the end goal with your >>> proposal for SA? I understand completely if this isn't a fit so don't >>> hesitate to bow out. >>> >>> regards, >>> KAM >>> >>> >>> On 4/3/2015 1:06 PM, Sarang Shrivastava wrote: >>> >>> Hello all, >>> >>> I am Sarang Shrivastava, an open source enthusiast from MNNIT, >>> Allahabad,India. >>> >>> While applying for this year's GSOC I committed a blunder, in the >>> initial phase I was interested in working with the RSPAMD organisation ( >>> Basically a SPAM filter ) and was working on the idea of "IMPLEMENTING >>> META-STATISTIC ALGORITHMS". >>> But while submitting the proposal I accidentally submitted it with the >>> Apache software foundation. >>> >>> I asked the mentors of both Rspamd and Apache to somehow transfer my >>> proposal to Rspamd but this can't happen now. >>> >>> The thing is my proposal is not organisation specific.Any open source >>> spam filtering project that does not has this idea can take the advantage >>> of it.I went through the Spamassasin wiki page and found out that it >>> only has Bayesian filtering as statistical classification technique, but >>> the other machine learning methods that I have listed in my proposal >>> could surprisingly increase the efficiently of the spam filtering process. >>> >>> So, it would really be appreciating if anyone could mentor me >>> throughout the GSOC period. I want to work on this proposal but unless >>> an until anyone of you signs up as a mentor and accept my proposal in >>> Melange before 12th of April I cannot work on it further. >>> >>> Please I kindly request if anyone among you who is interested in my >>> idea , please be my mentor. I am sure that given a chance to prove myself, >>> I would not disappoint you. >>> >>> The link to my proposal is : >>> https://www.google-melange.com/gsoc/proposal/review/student/google/gsoc2015/xlr_24/5629499534213120 >>> >>> I have also enclosed a copy of my proposal as an attachment. >>> PS: In my attached proposal wherever I wrote rspamd , I have replaced it >>> with Spamassasin. >>> >>> Cheers, >>> Sarang >>> >>> >>> >>> -- >>> *Sarang Shrivastava* >>> *Computer Science & Engineering* >>> *MNNIT Allahabad* >>> >>> >>> >>> >> >> >> -- >> *Sarang Shrivastava* >> *Computer Science & Engineering* >> *MNNIT Allahabad* >> >> >> >> -- >> *Sarang Shrivastava* >> *Computer Science & Engineering* >> *MNNIT Allahabad* >> >> >> >> -- >> *Kevin A. McGrail* >> President >> >> Peregrine Computer Consultants Corporation >> 3927 Old Lee Highway, Suite 102-C >> Fairfax, VA 22030-2422 >> >> http://www.pccc.com/ >> >> 703-359-9700 x50 / 800-823-8402 (Toll-Free) >> 703-798-0171 (wireless) >> [email protected] <[email protected]> >> >> > > > -- > *Sarang Shrivastava* > *Computer Science & Engineering* > *MNNIT Allahabad* > -- *Sarang Shrivastava* *Computer Science & Engineering* *MNNIT Allahabad*
