[Bug 7016] third party auto generated rules in sandboxes

bugzilla-daemon Mon, 24 Feb 2014 17:37:25 -0800

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7016

Adam Katz <[email protected]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |[email protected]

--- Comment #1 from Adam Katz <[email protected]> ---
On 02/18/2014 2:19 AM EST, Axb wrote:
> there's several reasons why I consider these rules as dangerous and
> should not be included in the default SA ruleset.
>
> 1.- rules are created outside the the project's infrastructure and
> SA devs have no way to quickly control/modify output in case
> something goes bad. 

Agreed.  To safely fold these in, they'd need to be unscored (or scored 0.001)
and used as predicates that are outside of the automated system.  This would
allow anybody with svn access to modify or even disable the rules.

> 2.- Spamcop is riddled with FPs and creating static rules based on
> it's output is either adding dangerous overlap or next to pointless
> due to low score. I'm positive this could initiate a whole separate
> discussion outside this thread. 

Disclaimer:  I work for SpamCop (though this rule generator hasn't had any
major code edits since before I started working there and only uses publicly
available data).

SpamCop is far better than it once was.  In the last network mass-check run, it
was pretty decent.  Also note that we're invoking it differently than any other
blocklist (others use last-external, RCVD_IN_BL_SPAMCOP_NET does not), so both
its ham and spam counts are inflated for that.

More importantly, excluding KHOP_SC_TOP200, sc-neighbors is not the SpamCop
blocklist (SCBL).  SCBL is a direct IP blocklist while sc-neighbors is an
abstraction that seeks spammy CIDRs.  For this reason, sc-neighbors should
never be assigned as many points as a DNSBL (unless DNSBLs are disabled and
you're trying to compensate for that).  Also, while sc-neighbors is restricted
to the last-external relay for its CIDR8 rules, its other rules look at all
untrusted headers (similar to the current RCVD_IN_BL_SPAMCOP_NET
implementation).

The only rule that is scored 0.5 or higher (in network-enabled tests) derives
its data from PSBL, not SpamCop, and PSBL has a far lower ham hit rate (0.0069%
to SCBL's 0.4783%, though note the implementation difference).

We would indeed need a separate discussion if we were to consider rescoring
these rules (by hand or by GA).  I wanted only to highlight that they're scored
very low in my channel.  The neighbor rules are there to boost scores, not to
block.

(KHOP_SC_TOP200 is a special case, implemented only to satisfy people
specifically requesting it or even suggesting use of some stale syndication of
a similar list (remember SARE_SPAMCOP_TOP200?) merely because mine didn't have
it.  This rule is disabled in my channel when network tests are enabled.)

> 4.- If your autocreating routine goes MIA, ppl are left with stale
> data - SA project has no control over that data. 

Yes, that is why I have asked for the ability to automatically expire rules. 
SA conditionals currently function only at load time, so we don't have the
ability to do that.  Theoretically, we need only update the rules and propagate
them.  These rules should never ship for use in systems that do not auto-update
(this is more true than for the rest of our rules, though those should also
never be used without the prospect of regular updates).

I'd also be happy to check the code into the main SA svn repo (releasing it as
Apache License), but only if we can reliably fold it into regular rule updates
so people don't need to run it on their own servers to use its output. 
(Releasing this may take some time, I have lots of stupidity in there for toy
projects that would need to be cleaned out.  ... and I have very little spare
time.)

> 5.- masscheck results are a very small snapshot of global traffic
> and static IP/CIDR lists should be avoided - stuff changes too fast
> and a delayed daily update of a rule file is fine in a separate
> sa-update channel (yours, in this case) but should not be part of
> the SA framework. 

This is true.  My channel updated every four hours (back when it was online),
but it only checks into svn daily (since mass-check only runs that often).

I also regularly run the channel against another corpus.  I can't give too many
details on it, but it's much bigger and much more frequent.  The results
contain a lower ham hit rate (and a higher spam hit rate).  That doesn't mean
it's fully "safe," but it is another indicator.  (Just in case you think there
might be some sampling bias given the sources shared with SpamCop, consider
this:  the hits on KHOP_SC_TOP200, which is 100% overlapped by SCBL, are half
as much in my corpus.)

> A good example of preoblematic auto generated problems are the
> SOUGHT rules, one of them being empty for many months and as things
> are now we have no immediate way to fix whatever is required so
> it's good for them to be in an optional channel outside the default
> SA scope. Admins have a choice to drop a third party channel and
> the SA dev group cannot be made responsible for any issues outside
> their control.

Agreed, thus the proposal to make it inside SA dev control.

> Last but not least, SA should deliver a basic ruleset which should
> work globally, as static as possible and auto generated stuff
> should not affect the framework's results.  Even autopromoted stuff
> has it's caveats and there are big plans to work on this to make SA
> leaner and avoid surprises.

The sandbox promotion system is essential since none of us are full time. 
Manual promotion would mean we're never prompt enough in adding or (more
importantly) removing items from scoring.  We also suffer from the
English-centric nature of the majority of our rules, a far cry from "global"
with no easy solution.

What "framework" are you referring to here?  SpamAssassin is more than just an
engine, it's also a collection of content signatures.  If we're to rely more on
the engine than on updates to the signatures, we've got to do a lot more work
on the tokenizer and its Bayesian evaluator.  I wouldn't call the DNSBLs + URI
DNSBLs + Bayes "enough," even with improvements to the latter.

> Personally, I don't consider it fair and questionable that you make
> use of the volunteered masschecker resources to do QA for your
> personal channel for years yet don't run a masschecker yourself. 

I'm working on participating in mass-check, but it is low priority.  I'll renew
the dialog.

None of these evaluations are expensive.  No online lookup is involved, they
are run through a regex optimizer, and the data they scan is very small (it's
not like the body!).  Consider this like a DNSBL; it just happens to take the
form of an sa-update channel.

These rules are in our mass-check to ensure that the channel has high efficacy
and so anybody interested in using the channel can look at the public freqs
data to verify that fact.  This is also why SOUGHT was checked in.

> For these reason I ask you to remove the 20_khop_sc_bug_6114.cf
> file from the sandbox 

If others feel this strongly, I will remove it.  As it stands right now,
removal is Alex's +1 to my -1, though I'd like to have more dialog before
bringing it to a vote.

If I do remove it, I may rework the whole thing into another DNSBL.  I've
considered this in the past (it's actually implemented, though I haven't
verified it in years, and it's down right now), it just seems like a waste
given how few entries are there.  (Plus it's nice to have this information
accessible to those who can't use DNSBLs for whatever reason, be it
configuration issues, network constraints, or worries of information leakage.)

I believe I've also responded to Kevin's points in the above text.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7016] third party auto generated rules in sandboxes

Reply via email to