Re: GSOC 2018 SpamAssassin Statistical Classifier Plugin

Saahil Sirowa Tue, 20 Mar 2018 00:58:30 -0700

Hi Kevin and Apache SpamAssassin Dev Community,

I have resolved all the changes you suggested in the previous draft.
1) I mentioned about learning PERL a week before the community bonding
period. It will not take much time. I can assure you that language is not
going to be an issue.
2) I updated the biography part a bit
3) Significant changes have been made in the Timeline.
4) I'm planning to used cmake/travis ci for automated testing. If there is
a better alternative please do suggest.
5) I gave links to research papers that i will be reading in the timeline.
6) I updated the timeline by mentioning to gain advanced information about
email traffic and spams. I listed some links for the purpose.
7) I updated the credits
8) There are other changes made in various parts of proposal.


Thanks for your previous detailed feedback.

Here is link to the updated proposal
GSoC 2018 proposal
<https://docs.google.com/document/d/1-OCNv79sHvVViKwnrRYtlMiKWLCzz4xUW4tNOlmaTmw/edit#heading=h.q7h3lddabdvh>
Please rigorously review it and suggest any changes that I should make.

Awaiting for a favorable response.


Thanks...
Saahil Sirowa
B. Tech Computer Science and Engineering
Indian Institute of Technology, Hyderabd

On Mon, Mar 19, 2018 at 3:27 AM, Kevin A. McGrail <[email protected]>
wrote:

> Hi Saahil
>
> re: Perl. As the project is primarily in Perl and you do not list that in
> your Proficiencies or any similar languages like PHP, I would address
> that.  The word Perl does not appear a single time.
>
> Your Biography is a little light on why this is something you feel you can
> implement.  The mentors will likely NOT be able to help you with the
> science rather focusing on the community, processes, and open source in
> general.
>
> re: Email and SPam, do you have any experience with email traffic or
> spam?  if so, add it.  If not, explain what you plan to do to address that.
>
> Re: Deliverables, I think you'll need to propose the first draft of that.
> But your goal will likely be a plugin for Apache SpamAssassin that can be
> installed and configured to provide multiple configurable statistical
> analysis algorithms to better identify ham (good email) and/or spam (bad
> email)
>
> Please use Apache SpamAssassin to properly brand the title.
>
> Re: I have no input on the scheduling/timelines except that past proposal
> I have read have included more phases and do not add "optional" items.  I'd
> prefer to see small increments to make sure you stay on schedule and don't
> get overwhelmed and find yourself way behind as the time progresses.
>
> Re: Testing Methodology, this is likely the most critical missing part.  I
> am a fan of test driven development where you set up tests that should pass
> and fall and use continuous testing as you add code to confirm your
> development is progressing well.
>
> This is especially important because spam analysis often doesn't work the
> way people expect and tests w/statistics can help identify issues.
>
> For example, this is a hypothesis that this statistical algorithms will be
> better than Bayes.  So you'll need a baseline for comparison.
>
> Additionally, even experts in the field are surprised when they think
> something will prove the hamminess of an email but in fact shows the
> opposite.  Real world example, SPF is a policy when introduced was supposed
> to allow an automated mechanism that says "this is an email from a
> legitimate mail server for my domain".
>
> However, the FIRST wave of people to adobt it were all spammers.  So it
> became a spam indicator more than a spam indicator.  It was a very
> interesting outcome.
>
> Re: Corpora, you'll want a corpora of carefully hand sorted ham and spam.
> Have you thought about how you'll get that?  I *might* be able to help but
> it's 50/50.
>
> Re: You mention reading research papers on statisical algorithms from a
> previous proposal.  You'll want to list them to show which ones you plan to
> study
>
> re: "Discussions with the SA community regarding the various types of
> spams that the present SA can handle." is unclear.  What is a "type of
> spam" to you?  Do you have a list of types of spam?
>
> re: "Brainstorming with the mentors and SA community about the various
> input features and parameters that can have a huge impact on the overall
> performance of the listed neural nets models." I think this is flawed.
> There won't be a ton of people who can discuss this with you.  You'll need
> to likely use scientific process to show what has a performance impact.
> This is not busy work or school work.  This is an experiment that has not
> been tried at the SA project.
>
> re: "actively involved with the community." is a stretch.  A few emails do
> not active involvement make.
>
> re: Bonding, you might consider raising that to 1-2 major bugs and 10-20
> minor bugs.
>
> Re: Credits/references, I would add more clarity about where each of those
> references are used.
>
> Regards,
> KAM
>

Re: GSOC 2018 SpamAssassin Statistical Classifier Plugin

Reply via email to