Re: GSOC 2018 SpamAssassin Statistical Classifier Plugin

Kevin A. McGrail Fri, 23 Mar 2018 05:31:48 -0700

Wanted to check in and see how you are doing.  THis blog post has gotten
some praise



https://medium.com/@owtf/google-summer-of-code-writing-a-good-proposal-141b1376f076
.

--
Kevin A. McGrail
Asst. Treasurer & VP Fundraising, Apache Software Foundation
Chair Emeritus Apache SpamAssassin Project
https://www.linkedin.com/in/kmcgrail - 703.798.0171

On Wed, Mar 21, 2018 at 7:52 AM, Kevin A. McGrail <[email protected]>
wrote:

> Comments allowed might be helpful though :-)
>
> --
> Kevin A. McGrail
> Asst. Treasurer & VP Fundraising, Apache Software Foundation
> Chair Emeritus Apache SpamAssassin Project
> https://www.linkedin.com/in/kmcgrail - 703.798.0171 <(703)%20798-0171>
>
> On Wed, Mar 21, 2018 at 12:36 AM, Rajkiran Rajkumar <
> [email protected]> wrote:
>
>> @Saahil, kindly make your doc view-only for people with a link to it.
>> Giving edit permissions to the world is a bad idea.
>>
>> Thanks,
>> Rajkiran
>>
>> On Tue, Mar 20, 2018 at 5:17 PM, Kevin A. McGrail <[email protected]>
>> wrote:
>>
>>> +users
>>>
>>> All we give is feedback.  The submission to GSoC is what matters.  So if
>>> you mentioned perl here that's not going to carryover to the reviewers.
>>>
>>> Can someone with fresh eyes take a look at this?  I read it too recently
>>> so I will gloss over it too much.
>>>
>>> Here are some posts the mentors list thought might be helpful.  The
>>> first I believe covers someone's pov who did not get selected.
>>>
>>> https://medium.freecodecamp.org/hacking-gsoc-how-to-gain-rea
>>> l-life-experience-and-support-open-source-b1e6a664f6e4?sourc
>>> e=linkShare-53ba2bb84284-1521381334
>>>
>>> https://sanatt.me/2017/12/30/cracking-google-summer-code-2018/
>>>
>>> Regards, KAM
>>>
>>> On Tue, Mar 20, 2018, 03:57 Saahil Sirowa <[email protected]>
>>> wrote:
>>>
>>>> Hi Kevin and Apache SpamAssassin Dev Community,
>>>>
>>>> I have resolved all the changes you suggested in the previous draft.
>>>> 1) I mentioned about learning PERL a week before the community bonding
>>>> period. It will not take much time. I can assure you that language is not
>>>> going to be an issue.
>>>> 2) I updated the biography part a bit
>>>> 3) Significant changes have been made in the Timeline.
>>>> 4) I'm planning to used cmake/travis ci for automated testing. If there
>>>> is a better alternative please do suggest.
>>>> 5) I gave links to research papers that i will be reading in the
>>>> timeline.
>>>> 6) I updated the timeline by mentioning to gain advanced information
>>>> about email traffic and spams. I listed some links for the purpose.
>>>> 7) I updated the credits
>>>> 8) There are other changes made in various parts of proposal.
>>>>
>>>> Thanks for your previous detailed feedback.
>>>>
>>>> Here is link to the updated proposal
>>>> GSoC 2018 proposal
>>>> <https://docs.google.com/document/d/1-OCNv79sHvVViKwnrRYtlMiKWLCzz4xUW4tNOlmaTmw/edit#heading=h.q7h3lddabdvh>
>>>> Please rigorously review it and suggest any changes that I should make.
>>>>
>>>> Awaiting for a favorable response.
>>>>
>>>>
>>>> Thanks...
>>>> Saahil Sirowa
>>>> B. Tech Computer Science and Engineering
>>>> Indian Institute of Technology, Hyderabd
>>>>
>>>> On Mon, Mar 19, 2018 at 3:27 AM, Kevin A. McGrail <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Saahil
>>>>>
>>>>> re: Perl. As the project is primarily in Perl and you do not list that
>>>>> in your Proficiencies or any similar languages like PHP, I would address
>>>>> that.  The word Perl does not appear a single time.
>>>>>
>>>>> Your Biography is a little light on why this is something you feel you
>>>>> can implement.  The mentors will likely NOT be able to help you with the
>>>>> science rather focusing on the community, processes, and open source in
>>>>> general.
>>>>>
>>>>> re: Email and SPam, do you have any experience with email traffic or
>>>>> spam?  if so, add it.  If not, explain what you plan to do to address 
>>>>> that.
>>>>>
>>>>> Re: Deliverables, I think you'll need to propose the first draft of
>>>>> that.  But your goal will likely be a plugin for Apache SpamAssassin that
>>>>> can be installed and configured to provide multiple configurable
>>>>> statistical analysis algorithms to better identify ham (good email) and/or
>>>>> spam (bad email)
>>>>>
>>>>> Please use Apache SpamAssassin to properly brand the title.
>>>>>
>>>>> Re: I have no input on the scheduling/timelines except that past
>>>>> proposal I have read have included more phases and do not add "optional"
>>>>> items.  I'd prefer to see small increments to make sure you stay on
>>>>> schedule and don't get overwhelmed and find yourself way behind as the 
>>>>> time
>>>>> progresses.
>>>>>
>>>>> Re: Testing Methodology, this is likely the most critical missing
>>>>> part.  I am a fan of test driven development where you set up tests that
>>>>> should pass and fall and use continuous testing as you add code to confirm
>>>>> your development is progressing well.
>>>>>
>>>>> This is especially important because spam analysis often doesn't work
>>>>> the way people expect and tests w/statistics can help identify issues.
>>>>>
>>>>> For example, this is a hypothesis that this statistical algorithms
>>>>> will be better than Bayes.  So you'll need a baseline for comparison.
>>>>>
>>>>> Additionally, even experts in the field are surprised when they think
>>>>> something will prove the hamminess of an email but in fact shows the
>>>>> opposite.  Real world example, SPF is a policy when introduced was 
>>>>> supposed
>>>>> to allow an automated mechanism that says "this is an email from a
>>>>> legitimate mail server for my domain".
>>>>>
>>>>> However, the FIRST wave of people to adobt it were all spammers.  So
>>>>> it became a spam indicator more than a spam indicator.  It was a very
>>>>> interesting outcome.
>>>>>
>>>>> Re: Corpora, you'll want a corpora of carefully hand sorted ham and
>>>>> spam.  Have you thought about how you'll get that?  I *might* be able to
>>>>> help but it's 50/50.
>>>>>
>>>>> Re: You mention reading research papers on statisical algorithms from
>>>>> a previous proposal.  You'll want to list them to show which ones you plan
>>>>> to study
>>>>>
>>>>> re: "Discussions with the SA community regarding the various types of
>>>>> spams that the present SA can handle." is unclear.  What is a "type of
>>>>> spam" to you?  Do you have a list of types of spam?
>>>>>
>>>>> re: "Brainstorming with the mentors and SA community about the various
>>>>> input features and parameters that can have a huge impact on the overall
>>>>> performance of the listed neural nets models." I think this is flawed.
>>>>> There won't be a ton of people who can discuss this with you.  You'll need
>>>>> to likely use scientific process to show what has a performance impact.
>>>>> This is not busy work or school work.  This is an experiment that has not
>>>>> been tried at the SA project.
>>>>>
>>>>> re: "actively involved with the community." is a stretch.  A few
>>>>> emails do not active involvement make.
>>>>>
>>>>> re: Bonding, you might consider raising that to 1-2 major bugs and
>>>>> 10-20 minor bugs.
>>>>>
>>>>> Re: Credits/references, I would add more clarity about where each of
>>>>> those references are used.
>>>>>
>>>>> Regards,
>>>>> KAM
>>>>>
>>>>
>>>>
>>
>

Re: GSOC 2018 SpamAssassin Statistical Classifier Plugin

Reply via email to