https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7016
Adam Katz <[email protected]> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |[email protected] --- Comment #1 from Adam Katz <[email protected]> --- On 02/18/2014 2:19 AM EST, Axb wrote: > there's several reasons why I consider these rules as dangerous and > should not be included in the default SA ruleset. > > 1.- rules are created outside the the project's infrastructure and > SA devs have no way to quickly control/modify output in case > something goes bad. Agreed. To safely fold these in, they'd need to be unscored (or scored 0.001) and used as predicates that are outside of the automated system. This would allow anybody with svn access to modify or even disable the rules. > 2.- Spamcop is riddled with FPs and creating static rules based on > it's output is either adding dangerous overlap or next to pointless > due to low score. I'm positive this could initiate a whole separate > discussion outside this thread. Disclaimer: I work for SpamCop (though this rule generator hasn't had any major code edits since before I started working there and only uses publicly available data). SpamCop is far better than it once was. In the last network mass-check run, it was pretty decent. Also note that we're invoking it differently than any other blocklist (others use last-external, RCVD_IN_BL_SPAMCOP_NET does not), so both its ham and spam counts are inflated for that. More importantly, excluding KHOP_SC_TOP200, sc-neighbors is not the SpamCop blocklist (SCBL). SCBL is a direct IP blocklist while sc-neighbors is an abstraction that seeks spammy CIDRs. For this reason, sc-neighbors should never be assigned as many points as a DNSBL (unless DNSBLs are disabled and you're trying to compensate for that). Also, while sc-neighbors is restricted to the last-external relay for its CIDR8 rules, its other rules look at all untrusted headers (similar to the current RCVD_IN_BL_SPAMCOP_NET implementation). The only rule that is scored 0.5 or higher (in network-enabled tests) derives its data from PSBL, not SpamCop, and PSBL has a far lower ham hit rate (0.0069% to SCBL's 0.4783%, though note the implementation difference). We would indeed need a separate discussion if we were to consider rescoring these rules (by hand or by GA). I wanted only to highlight that they're scored very low in my channel. The neighbor rules are there to boost scores, not to block. (KHOP_SC_TOP200 is a special case, implemented only to satisfy people specifically requesting it or even suggesting use of some stale syndication of a similar list (remember SARE_SPAMCOP_TOP200?) merely because mine didn't have it. This rule is disabled in my channel when network tests are enabled.) > 4.- If your autocreating routine goes MIA, ppl are left with stale > data - SA project has no control over that data. Yes, that is why I have asked for the ability to automatically expire rules. SA conditionals currently function only at load time, so we don't have the ability to do that. Theoretically, we need only update the rules and propagate them. These rules should never ship for use in systems that do not auto-update (this is more true than for the rest of our rules, though those should also never be used without the prospect of regular updates). I'd also be happy to check the code into the main SA svn repo (releasing it as Apache License), but only if we can reliably fold it into regular rule updates so people don't need to run it on their own servers to use its output. (Releasing this may take some time, I have lots of stupidity in there for toy projects that would need to be cleaned out. ... and I have very little spare time.) > 5.- masscheck results are a very small snapshot of global traffic > and static IP/CIDR lists should be avoided - stuff changes too fast > and a delayed daily update of a rule file is fine in a separate > sa-update channel (yours, in this case) but should not be part of > the SA framework. This is true. My channel updated every four hours (back when it was online), but it only checks into svn daily (since mass-check only runs that often). I also regularly run the channel against another corpus. I can't give too many details on it, but it's much bigger and much more frequent. The results contain a lower ham hit rate (and a higher spam hit rate). That doesn't mean it's fully "safe," but it is another indicator. (Just in case you think there might be some sampling bias given the sources shared with SpamCop, consider this: the hits on KHOP_SC_TOP200, which is 100% overlapped by SCBL, are half as much in my corpus.) > A good example of preoblematic auto generated problems are the > SOUGHT rules, one of them being empty for many months and as things > are now we have no immediate way to fix whatever is required so > it's good for them to be in an optional channel outside the default > SA scope. Admins have a choice to drop a third party channel and > the SA dev group cannot be made responsible for any issues outside > their control. Agreed, thus the proposal to make it inside SA dev control. > Last but not least, SA should deliver a basic ruleset which should > work globally, as static as possible and auto generated stuff > should not affect the framework's results. Even autopromoted stuff > has it's caveats and there are big plans to work on this to make SA > leaner and avoid surprises. The sandbox promotion system is essential since none of us are full time. Manual promotion would mean we're never prompt enough in adding or (more importantly) removing items from scoring. We also suffer from the English-centric nature of the majority of our rules, a far cry from "global" with no easy solution. What "framework" are you referring to here? SpamAssassin is more than just an engine, it's also a collection of content signatures. If we're to rely more on the engine than on updates to the signatures, we've got to do a lot more work on the tokenizer and its Bayesian evaluator. I wouldn't call the DNSBLs + URI DNSBLs + Bayes "enough," even with improvements to the latter. > Personally, I don't consider it fair and questionable that you make > use of the volunteered masschecker resources to do QA for your > personal channel for years yet don't run a masschecker yourself. I'm working on participating in mass-check, but it is low priority. I'll renew the dialog. None of these evaluations are expensive. No online lookup is involved, they are run through a regex optimizer, and the data they scan is very small (it's not like the body!). Consider this like a DNSBL; it just happens to take the form of an sa-update channel. These rules are in our mass-check to ensure that the channel has high efficacy and so anybody interested in using the channel can look at the public freqs data to verify that fact. This is also why SOUGHT was checked in. > For these reason I ask you to remove the 20_khop_sc_bug_6114.cf > file from the sandbox If others feel this strongly, I will remove it. As it stands right now, removal is Alex's +1 to my -1, though I'd like to have more dialog before bringing it to a vote. If I do remove it, I may rework the whole thing into another DNSBL. I've considered this in the past (it's actually implemented, though I haven't verified it in years, and it's down right now), it just seems like a waste given how few entries are there. (Plus it's nice to have this information accessible to those who can't use DNSBLs for whatever reason, be it configuration issues, network constraints, or worries of information leakage.) I believe I've also responded to Kevin's points in the above text. -- You are receiving this mail because: You are the assignee for the bug.
