On 8/31/2014 7:55 AM, Axb wrote:
On 08/31/2014 04:08 PM, Ted Mittelstaedt wrote:
Out of the box the default decision point of 5 is too high anyway.

SA is the framework - you can tune to your need as much as you want.

I think the emphasis on avoiding false positives in the stock
(non-Bayes) distribution is far too high. I suspect that over
the years many good rule submissions have been ignored because
incidence of false positives with them was too high for the
SA maintainers.

During the last +-4 years, scores have been set by the masscheck GA system.
IF more ppl would contribute with masschecks and rules, detection could
be better, but the lack of volunteers doing this shows that apparently
what SA does is good enough or there is little interest in commitment.


masscheck runs against your spam and ham.  But, masscheck does not know
if what your feeding it is actually ham or spam until you have gone through your corpora and sorted it - moved the spam to the spam folder and the ham to the ham folder (assuming that is that you get any false positives) That is why you say you want the corpora cleaned and hand classified.

This is something that I only do every once in a while when I'm preparing corpora for my bayes database. If I setup masscheck to
look at my inbox and my junk mail folder on a nightly basis, there
is no guarantee that I happened to get to my mail that day or that week
even to make sure that only ham is in my inbox and only spam is in
my junk mail folder.

If I have a folder full of spam that my local install of SpamAssassin has already marked as spam, then how does telling the SA project
"yep, ya got that right" change anything in the rules scoring?

There is a lack of explanation on the masscheck page as to how and
why it's useful.  And it is also clear that accidentally leaving spam
(spam that has not been identified as spam by SA) in your ham folder,
and false positives (ham) in your spam folder, is not going to help
masscheck any - if anything it's going to make the SA scoring worse.
That seems to me to be very important.

Perhaps that is why so few participate?  They do not understand why
masscheck is important to the SA project because the documentation on
it does not explain why.

For the same reason, SARE went belly up after volunteers drifted to new
interests, jobs, had families, etc.

The lack of general commitment and a general passive attitude expecting
"others" to do the job doesn't help at all.


That is a blame game that a lot of people on OSS projects take.

Most "others" out there using OSS packages do not have the skills to
contribute development time, even to contribute rules that do not
have unintended consequences.  You might think it simple to write
a rule but it's not the writing it that is the problem it is the
thinking about the consequences.

I've seen some real showstoppers in SpamAssassin rules such as the time
that someone wrote a rule to target certain spam that ended up triggering off Outlook Express - and when confronted with this the
authors response was along the lines of "well OE does not produce
an RFC compliant header so it's not MY problem"  Well sure, he was
right that OE does not produce an RFC compliant header - it's a piece
of crap.  Unfortunately at the time the rule was inserted it was a
piece of crap used by 1/2 of the Internet.  In other words he was not
willing to own the fact that this clever thing he had discovered
and turned into a rule was unusable because 1/2 of the users on the
Internet are morons using ancient crap mail client software.

I'd rather have fewer better rules from the SA developers who seem
to have the understanding of unintended consequences than more rules
from hotshots that figure they can go on a crusade to make everyone
on the Internet use the latest version of Thunderbird.  I just think
the SA developers are falling just a bit too conservative on this.

Anyway, IMHO the people complaining about others not kicking back to OSS
projects really need to start by taking this beef up
with the people bundling SA in commercial products (like Untangle
firewall - which uses SA in it's "free" version of Untangle which
acts as marketing slippery slope fodder to get people into the
commercial product) because those people are developers already,
and making significant coin off the OSS project. It seems to me that those people have a far stronger moral obligation on them to contribute development time to the SA project, than some admin out there of a company mailserver who barely knows what the term regexp means.

Disclaimer:  For all I know Untangle developers do kick coding time
back to the SA project - I'm using them as a convenient example to illustrate the issue.

For a newbie to SA it is disheartening to install SA and not
get 90% with a 2% false positive, out of the box, but rather get
50% with a 0% false positive. And I think that is a mistake the
maintainers are making is over-reliance on bayes.

Mantainers do what they can, on a voluntary basis. If newbies expect SA
to be FUSP out of the box, then they didn't get enough info beforehand.


newbies expect any software product they install be it commercial or OSS
to be fully configured out of the box. That's the definition of a newbie. I'm not excusing it, merely explaining reality.

At the least the SA maintainers should maintain a separate
"highly aggressive" rule distro that was optional that would
give us a much higher success rate with a corresponding
slight increase in false positives.

"should" ? SA devs are volunteers, contributing time and resources with
little return other than some personal satisfaction of helping others.
SA's develpment is not funded or backed by some multimillion corp.

What are you doing to contribute ?

SA is the framework - if you wish to start a sa-update channel for extra
agressive rules_du_jour you're welcome to do it and if you find some
volunteers to help you, even better.


For starters as a SA user I do not feel the project is served by
multiple sa-update channels promulgating different rulesets, if I
had the coding ability to create a huge body of rules on par with
the existing SA rules, I would absolutely not set it up as a competing
ruleset.

I feel that whenever someone develops a rule that succeeds in catching
some spam and not damaging ham, that it should go into the main SA ruleset. That will get the widest distribution as quickly as possible.

But you misunderstand what I am saying.  I do not find fault with
how SA operates internally, my beef is with it's out-of-box configuration not being aggressive enough. Since I know how to modify
it's configuration to get it more aggressive I do not have a problem
with it.  But I believe that newbies who do not engage Bayes are not
getting enough filtering from it.

Their design approach has been to rely on Bayes to be trained to go from
50% capture out of box with 0% FP to 80-90% capture with 0% FP.

an assumption, based on what?


Observation.  Fine you don't want to believe me, go ahead but I have
spent a lot of time observing it on my servers.

But, the design approach could easily be relying on Bayes to go
from 90% capture with 5% FP out of the box, to 90% capture with
0% FP with Bayes, and the emphasis being on training Bayes on ham,
not spam.

Note I am pulling the percentages out of my ass, but I think you
get the idea.

By design, SA's Bayes is not FUSP, it's a small part of the arsenal -
depending on your skill to write rules, make use of other SA features,
etc, you can even run a very efficient filtering system without it.

There are simple methods to automagically feed Bayes with lots of spam
or ham - depending on what you feel you need most. It's up to you to be
creative and make use of SA's ton of features (including third party
rules/plugins)


I do not understand why you think I don't agree with this statement.

I am merely making an observation that the maintainers are approaching
SA with the idea of it's out of box configuration being very soft, and
letting a lot of spam through until the admin starts tweaking knobs
and flipping switches.

There is nothing you have said that invalidates the approach of SA
having an out of box config that is very hard and lets very little spam
through until the admin starts tweaking knobs and flipping switches. It is simply the opposite approach.

And I am saying that I think it would be a better approach.  You are
not addressing that.

Ted

Reply via email to