Re: Fwd: LOOKING OUT FOR A MENTOR FOR GSOC 2015

Sarang Shrivastava Tue, 07 Apr 2015 10:01:06 -0700

Hello Kevin,

Looking forward to hear from you soon !
Is there any other mode of communication that we can use ?? What do you
suggest ? I guess Skype or Google hangout should solve the issue.


Cheers,
Sarang

On Mon, Apr 6, 2015 at 3:07 PM, Sarang Shrivastava <[email protected]>
wrote:

> Hello Kevin,
>
> Sorry for the late reply, was out of town actually.Well the answers to
> your queries are ( I have tried answering them but I am not sure about
> their correctness ) :-
>
> So presently SA uses Bayesian classifier together with some additional DNS
> filters to check for Spam.
> Firstly the present SA doesn't use any of the neural nets model that I
> have listed in my proposal.Secondly the new words that are not present in
> the Bayes database, SA assigns a very high probability to it. But there are
> chances that together with some garbage there can be a meaning full message
> along with it.
>
> Regarding the plugins, what I think is going with multiple plugins within
> a module named statistical classifier. So , this module will contain all
> the plugins for the models that I have listed together. By default, The
> model that gives the best result out of the listed models will be on. But
> the user will be given the flexibility to choose from a range of
> statistical plugins to choose from, so that if in the future any additional
> methods together with present methods give a better result, so that plugin
> can be switched on.
>
> Well the neural nets that I have listed in my proposal seems to be better
> on paper because of the following reasons:-
>
> 1) NB is not so scalable algorithm for large number of emails and where
> most of the words are random nouns.
>
> 2) NB is a good approach to classify emails but it doesn't do well at all
> in front of good unsupervised learning because size   of the email
> doesn't really determine its importance, even small e mails can be spam
> free.
>
> For example,
>
>  a) we are considering you for a job ..
>
> b) and an urgent job is posted for you ..
>
> Now sentence number one might be from a possible employer with whom you
> have applied whereas sentence two is 90% from those spammers which send
> random job requirements to people
> The second one falls more into the category of a spam, So if the user
> classifies it as a spam, then the weightage of the combine "urgent job"
> will be more than the rest of the features, But in case of Bayesian
> filtering each and ebery features are considered independent. The grammar
> and language features are not included with the Bayesian filtering but with
> the neural nets a lot of things can come into action.
>
> A thought which you would like to consider:
>
> I didn't listed SVM in my proposal because of the fact that while I was
> attached with RSPAMD, there were two separate ideas, one was for
> implementing the supervised neural nets, the other was of unsupervised SVM,
> SA project has neither of them, So what I was thinking that considering SVM
> also is not a bad option, because in cases where frequency of words matter
> SVM give the best result.
>
>
>
>
>
>
>
> On Sun, Apr 5, 2015 at 2:16 AM, Kevin A. McGrail <[email protected]>
> wrote:
>
>>  Thanks Sarang.  I got your email to my address as well but it's a
>> holiday weekend for me in the states. (Happy Easter to all those who
>> celebrate!)
>>
>> It looks to me like you understand that programming is a state of mind
>> not a language which is good and you are capable of switching gears.
>>
>> I will sign up as a mentor on Monday and let you know when that is done.
>>
>> From there, you can look at the SA code and answer the basic questions
>> below because to me your proposal needs clarification to switch to this
>> project.  There is a lot of information in the proposal I don't grok so I
>> think the basic high level questions for you are:
>>
>> - What does SA have now related to your proposal?
>> - What you propose?  A plugin?  Multiple plugins?
>> - Why is this anticipated to be better than what exists now.
>>
>> Regards,
>> KAM
>>
>>
>>
>>
>>
>> On 4/4/2015 4:17 PM, Sarang Shrivastava wrote:
>>
>>
>> ---------- Forwarded message ----------
>> From: Sarang Shrivastava <[email protected]>
>> Date: Sat, Apr 4, 2015 at 11:15 AM
>> Subject: Re: LOOKING OUT FOR A MENTOR FOR GSOC 2015
>> To: "Kevin A. McGrail" <[email protected]>
>>
>>
>>  Hi Kevin,
>>
>>  Before I came in contact with Rspamd I didn't knew lua at all, but
>> within a week I was proficient enough so that I could atleast be able to
>> understand the part written in lua (in the rspamd source code). As you know
>> necessity is the mother of all inventions, learning perl and redis  would
>> not be a hurdle.
>>
>>  I was just worried about the fact that first of all I need to look up
>> for mentor, and now when I have one with me (hopefully you seem to be
>> interested) , so starting from today itself I will dig more into the source
>> code of SA and brush upon my perl and redis skills.
>>
>>  Regarding the dataset What I plan is :
>>
>>  Firstly I could directly use the famous enron dataset for spam filters
>> :-
>> http://www.aueb.gr/users/ion/data/enron-spam/
>>
>>  Secondly one more thing can be done, I take the spam dataset from :
>> http://untroubled.org/spam/
>>  which has a collection of spams from 1998-2011 and take the ham dataset
>> from my own mail account by importing my or for the matter of fact anyones
>> mails from the gmail server.
>> https://www.mattcutts.com/blog/backup-gmail-in-linux-with-getmail/
>>
>>  I'll set up my development environment today itself . I didn't got one
>> of your questions "Additionally, what resources do you have to develop and
>> test this code on ?". By this did you meant that where would I test my
>> code, for that initially I would just work upon the test data and directly
>> take input from the dataset in my perl script ( which I would be writing) .
>> Or if  SA has any testing framework I could use that and test my script.
>>  Or If I need to write the unit tests myself - that could be done but it
>> would be better if there is some framework that I could use.
>>
>>  Just a thought,
>> While going through the SA source code I came across a script for that
>> said  "This is the general class used to train a learning classifier with
>> new samples of spam and ham mail, and classify based on prior training." in
>> its comments.
>>  But I guess this is primarily for Bayesian filtering.
>>  If this is the case I can design a similar script for my testing
>> purpose.
>>
>>  One more thing , once I am done with the coding part , I can just put a
>> off the filter on the other rules that SA uses to filter spams and then in
>> turn just put on the the filter for my code. This would guarantee that
>> everything is working fine and then I would have to focus just on improving
>> the performance of the filtering process.
>>
>>  So what I plan for the upcoming week is to take a deeper look into the
>> SA source code ( The part where Bayesian filtering is implemented ) and
>> meanwhile learning perl and redis side by side.
>>
>>  What else do you want me to do ? Your suggestions are most welcome and
>> would help me to have a better understanding about the SA project and how
>> to get things done.
>>
>>  Cheers,
>>  Sarang
>>
>> On Fri, Apr 3, 2015 at 11:47 PM, Kevin A. McGrail <[email protected]>
>> wrote:
>>
>>>  Hi Sarang,
>>>
>>> I've mentored in past GSOCs so I'm interested in helping you but I am
>>> concerned about your proposal and the SpamAssassin project.  So I can't
>>> sign off on it as-is but I'd like to see if we can fix that.
>>>
>>> The SA project is built on plugins primarily in perl.  I didn't see perl
>>> or Redis in your proficiencies which I have no doubt you can learn but I'd
>>> like to know more about your plans with that.
>>>
>>> You also mentioned a data set and I'm not sure what data set you plan to
>>> use for testing.  Additionally, what resources do you have to develop and
>>> test this code on?  These may be simple or difficult hurdles but they merit
>>> attention.
>>>
>>> Just replacing spamassassin where rspamd exists doesn't really mean the
>>> Project Proposal is ready to go because of things like the plugin  language
>>> (not lua), etc.
>>>
>>> Can you look at SA and delve a bit more into the end goal with your
>>> proposal for SA?  I understand completely if this isn't a fit so don't
>>> hesitate to bow out.
>>>
>>> regards,
>>> KAM
>>>
>>>
>>> On 4/3/2015 1:06 PM, Sarang Shrivastava wrote:
>>>
>>>  Hello all,
>>>
>>>  I am Sarang Shrivastava, an open source enthusiast from  MNNIT,
>>> Allahabad,India.
>>>
>>>  While applying for this year's GSOC I committed a blunder, in the
>>> initial phase I was interested in working with the RSPAMD organisation (
>>> Basically a SPAM filter ) and was working on the idea of "IMPLEMENTING
>>> META-STATISTIC ALGORITHMS".
>>> But while submitting the proposal I accidentally submitted it with the
>>> Apache software foundation.
>>>
>>>  I asked the mentors of both Rspamd and Apache to somehow transfer my
>>> proposal to Rspamd but this can't happen now.
>>>
>>>  The thing is my proposal is not organisation specific.Any open source
>>> spam filtering project that does not has this idea can take the advantage
>>> of it.I went through the Spamassasin wiki page and found out that it
>>> only has Bayesian filtering as statistical classification technique, but
>>> the other machine learning methods that I have listed in my proposal
>>> could surprisingly increase the efficiently of the spam filtering process.
>>>
>>>  So, it would really be appreciating if anyone could mentor me
>>> throughout the GSOC period. I want to work on this proposal but unless
>>> an until anyone of you signs up as a mentor and accept my proposal in
>>> Melange before 12th of April I cannot work on it further.
>>>
>>>  Please I kindly request if anyone among you who is interested in my
>>> idea , please be my mentor. I am sure that given a chance to prove myself,
>>> I would not disappoint you.
>>>
>>>  The link to my proposal is :
>>> https://www.google-melange.com/gsoc/proposal/review/student/google/gsoc2015/xlr_24/5629499534213120
>>>
>>>  I have also enclosed a copy of my proposal as an attachment.
>>> PS: In my attached proposal wherever I wrote rspamd , I have replaced it
>>> with Spamassasin.
>>>
>>>  Cheers,
>>> Sarang
>>>
>>>
>>>
>>>  --
>>> *Sarang Shrivastava*
>>> *Computer Science & Engineering*
>>> *MNNIT Allahabad*
>>>
>>>
>>>
>>>
>>
>>
>> --
>> *Sarang Shrivastava*
>> *Computer Science & Engineering*
>> *MNNIT Allahabad*
>>
>>
>>
>>  --
>> *Sarang Shrivastava*
>> *Computer Science & Engineering*
>> *MNNIT Allahabad*
>>
>>
>>
>> --
>> *Kevin A. McGrail*
>> President
>>
>> Peregrine Computer Consultants Corporation
>> 3927 Old Lee Highway, Suite 102-C
>> Fairfax, VA 22030-2422
>>
>> http://www.pccc.com/
>>
>> 703-359-9700 x50 / 800-823-8402 (Toll-Free)
>> 703-798-0171 (wireless)
>> [email protected] <[email protected]>
>>
>>
>
>
> --
> *Sarang Shrivastava*
> *Computer Science & Engineering*
> *MNNIT Allahabad*
>



-- 
*Sarang Shrivastava*
*Computer Science & Engineering*
*MNNIT Allahabad*

Re: Fwd: LOOKING OUT FOR A MENTOR FOR GSOC 2015

Reply via email to