Re: GSOC 2018 SpamAssassin Statistical Classifier Plugin

Kevin A. McGrail Sun, 18 Mar 2018 14:58:26 -0700

Hi Saahil

re: Perl. As the project is primarily in Perl and you do not list that in
your Proficiencies or any similar languages like PHP, I would address
that.  The word Perl does not appear a single time.


Your Biography is a little light on why this is something you feel you can
implement.  The mentors will likely NOT be able to help you with the
science rather focusing on the community, processes, and open source in
general.

re: Email and SPam, do you have any experience with email traffic or spam?
if so, add it.  If not, explain what you plan to do to address that.

Re: Deliverables, I think you'll need to propose the first draft of that.
But your goal will likely be a plugin for Apache SpamAssassin that can be
installed and configured to provide multiple configurable statistical
analysis algorithms to better identify ham (good email) and/or spam (bad
email)

Please use Apache SpamAssassin to properly brand the title.

Re: I have no input on the scheduling/timelines except that past proposal I
have read have included more phases and do not add "optional" items.  I'd
prefer to see small increments to make sure you stay on schedule and don't
get overwhelmed and find yourself way behind as the time progresses.

Re: Testing Methodology, this is likely the most critical missing part.  I
am a fan of test driven development where you set up tests that should pass
and fall and use continuous testing as you add code to confirm your
development is progressing well.

This is especially important because spam analysis often doesn't work the
way people expect and tests w/statistics can help identify issues.

For example, this is a hypothesis that this statistical algorithms will be
better than Bayes.  So you'll need a baseline for comparison.

Additionally, even experts in the field are surprised when they think
something will prove the hamminess of an email but in fact shows the
opposite.  Real world example, SPF is a policy when introduced was supposed
to allow an automated mechanism that says "this is an email from a
legitimate mail server for my domain".

However, the FIRST wave of people to adobt it were all spammers.  So it
became a spam indicator more than a spam indicator.  It was a very
interesting outcome.

Re: Corpora, you'll want a corpora of carefully hand sorted ham and spam.
Have you thought about how you'll get that?  I *might* be able to help but
it's 50/50.

Re: You mention reading research papers on statisical algorithms from a
previous proposal.  You'll want to list them to show which ones you plan to
study

re: "Discussions with the SA community regarding the various types of spams
that the present SA can handle." is unclear.  What is a "type of spam" to
you?  Do you have a list of types of spam?

re: "Brainstorming with the mentors and SA community about the various
input features and parameters that can have a huge impact on the overall
performance of the listed neural nets models." I think this is flawed.
There won't be a ton of people who can discuss this with you.  You'll need
to likely use scientific process to show what has a performance impact.
This is not busy work or school work.  This is an experiment that has not
been tried at the SA project.

re: "actively involved with the community." is a stretch.  A few emails do
not active involvement make.

re: Bonding, you might consider raising that to 1-2 major bugs and 10-20
minor bugs.

Re: Credits/references, I would add more clarity about where each of those
references are used.

Regards,
KAM

Re: GSOC 2018 SpamAssassin Statistical Classifier Plugin

Reply via email to