(I hope this reply makes it to the right place, I'm using Earthlink's webmail,
which is less than intuitive about where replies will go.)
We could potentially be more aggressive, but the problem becomes FP rates.
You can look for anything like \w+\.\w+, but then things like
Run command.com
I have to say, my favorite is definitely 4.3h, but with the smooth
arrow, not the cut and pasted one. :-)
3 - Define 1 (ONE) color scheme from the ones I created.
4.3h 4.3g 4.3f
Hi Daniel. I mucked with 4.3h a little bit to get it a little closer to
what I would like; see attached. If
- rename the current Mail::SpamAssassin::PerMsgStatus class to
Mail::SpamAssassin::Scan
Personally I'd prefer Mail::SpamAssassin::Assassinate. No, it doesn't make
any obvious sense, but a project isn't worthwhile if you can't have one
class somewhere with a fun name that everyone just
so I'm thinking that we should replace parts of this with arrays, using
integer indexes, instead of hashes with string indexes.
Array lookups are quite a bit faster than hash lookups.
I have no idea how painful linked lists are in Perl (or if they even exist).
But if you are essentially
I have no idea how painful linked lists are in Perl (or if they even
exist).
Why are you commenting then???
Because they are very useful, as I pointed out.
They don't exist as a native data structure. Arrays are fast, painless,
and dynamically sized.
They don't exist as a native data
Well, we were thinking --lint reports errors, --lint --debug or -D
- --lint would report errors and warnings (since warnings would be
generated as debug-level messages from --lint).
But -D in general throws TONS of messages whether anything is broken or not,
making it virtually necessary to
Looks very interesting. The \1 is a performance-killer, though. This
really needs to be implemented as an eval rule... something like
sub check_whatever {
my ($self) = @_;
my $mid = $self-get('MESSAGEID');
if ($mid =~ m/[A-Z]{28}\.(.+?)/) {
my $from = $self-get('From');
however: 100 URLs is pretty low. it's worth noting these are the *first*
100 URLs found in the message, but still -- there may be a way a spammer
could overload this and get past SpamAssassin by loading up 100 URLs
before their payload URL.
thoughts?
Possibly you could prioritize urls
I find it interesting that the processor usage is (head tests, eval tests,
body tests) in that order.
I would normally expect headers to be no larger than the body in most cases.
This implies that either my assumption is wrong, or head tests are more
complex, or there are more head tests than body
An alternate simple case to detect local mail delivery would be to count the
received headers, whether they can be parsed or not, and are trusted or not.
If #received-hdrs = 1 and trusted==0 and untrusted==0, assume local
delivery and trust it. Probably should be a configurable option if done at
I think I'm going to come down on the other side of this from Tony, and from
the wontfix closure on 3650 or whatever it was.
General philosophy: if it is easy to f*** it up in Exim, and easy to correctly
parse the f**'ed results, go ahead and do it. (Alternately, have a debug or
even a
Doesn't the free VC install include nmake? The normal one does.
The DDK also includes Nmake, and a considerably newer version than what
comes with the standard VC++ 6.0. Unfortunately the current DDK only comes
on CD these days, however it is still free, save postage and time.
Loren
I do not agree with this conclusion. As I already commented on another
bug ([Bug 3085] TRACKER_ID rule not very useful) some languages simply
use longer words/sentences (on average) than English.
Having no short and accurate translations of many/most computer related
English terms complicates the
Someone a few months back already implemented a way to integrate SA with at
least one of the blog tools, and I think reported that it helped a lot.
This was just using normal SA filtering, I believe along with a modified
rule base. I think this implementation was pre-3.0, or no later than the
Any thoughts on this?
For certain rules I think it would be a great idea.
The trick is not so much running the rule globally, as it is getting the hit
count to use in the score generation. I don't think (although I may be
wrong) that Perl can tell you how many times a regex hit in a global
I'd have to take this into account when optimising the scores. Then,
since the scores would be optimised for multiple hits, spammers would
only have to reduce the number of hits to evade SpamAssassin.
This strikes me as more of an implementation problem than an argument
against the concept.
If I don't take it into account when optimizing the scores, then the
increased scores will cause more false positive errors.
What you probably need to take into account is the cumulative score for the
rule in the test corpus. Which of course you do for all rules. The only
oddity you would
oh good, so you've changed your mind since
http://bugzilla.spamassassin.org/show_bug.cgi?id=3781#c3 then ;)
Somewhat. I still think it should be a plugin.
There's a problem with plugins I hadn't realized when they were originally
being advertized as the universal solution to oddball rules.
I was thinking along the lines of something that SpamAssassin downloads
once a month, or queries to find out if there is an update available and
only downloads if there is. Since the idea is to limit DNS queries, of
While it isn't part of the offical SA project, this sounds like exactly a
job
they solved and I'm sick to death of them! :(Some sensible wrapping
code would be simpler, and save EVERYONE a lot of trouble.
I don't think wrapping code is a solution at all. Email is
fundamentally 80 columns. Names that go over about a quarter of that
length mean that the
next if ($answer-type ne 'A' $answer-type ne 'TXT');
# skip any A record that isn't on 127/8
next if ($answer-type eq 'A' $answer-rdatastr !~ /^127\./);
Shouldn't that prevent what Vance's comment #23 debug log output shows
even if
the wrong query's results were associated with a
I agree with Daniel that the new formatting isn't real nice; and in the
cited example I can't see any reason for the formatting change.
However I also strongly agree with Justin's reasoning and stated limits on
doing this.
I think I regard this as bugs in the new formatting code that can be
Hum, what's wrong with this encoding...
--=_NextPart_000_91FF8_43B69930.6DEB20A0
Content-Type: text/plain;
charset=iso3 8 2isyw34 8 8udg
How about a simple debug printout of the id value sent and the id value
received? Maybe it is as simple as the id matching code is failing.
Loren
How big is the masscheck run? Probably lots of messages and lots of
requests?
I'm betting at some point the client pipe number cycles. It is limited to a
range significantly smaller than 2^16, but I don't recall the exact range
limits.
How, why this should make you start seeing duplicates I
$self-{_initted} = $$;
srand;
$self-{resolver}-reinit_post_fork();
In the C/C++ world, srand has a parameter. I suspect the C srand() function
underlys perl's srand statement. Depending on the value of the parameter
that Perl is deducing from that statement, you may or may
I'd certainly vote for that, if I had a vote!
Loren
How about making due dilligence easier when you KNOW there is a probable
change required (ie: the need to add a plugin line to maintain default
bahaviour that didn't require such a line in the previous release(s)).
At the end of make install or some reasonable place, run a scan of init.pre,
or
I believe other delimiters are legal.
Indeed. I commonly write
m'stuff'i
if I'm going to be matching slashes.
BTW, I also will commonly write things like
=~ /BADWord/# no /i
to make it clear that, no, I *did not* forget to make that test case
insensitive.
Depending on what leads
I prefer uint for bitmasks. Other than that it seems fine. I like the idea
of going to tell or some such; 'collabreport' gives me the shudders.
Loren
How interesting. Six of em, and all in a row!
Loren
It sounds like whenever we check for version, we should do something like:
$version =~ s/_\d.*$//;
Why not s/_/\./ instead? Or if you wantr to treat it as a float, just drop
the underscore or replace it with a zero? Either of those would preserve
the fact that you have a fractionally higher
Title: RFC: Normalized text ruletype
Wow, neat! I've been looking at something like this for
quite some time.
Adding in pipes and some of the other characters known to be
used for obfuscations could well drastically increase your hit ratios, they are
really common.
I think this is quite
Is this the right forum for this question.
The user's list would have been the more appropriate place.
body DRUG_ED_CAPS /\bCIALIS|LEVITRA|VIAGRA/
according to my resident regex expert will only look for the \b in front
of the CIALIS,
and not in front of LEVITRA or VIAGRA
Seems
I just remembered that I used strchr(3) in my last commit to spamc and
according to the man page is that one part of C99, so might be missing on
some system (?).
FWIW, I don't believe I've *ever* seen a C/C++ implementation for any real
system that didn't have strchr, all the way back to the
_SCORE(PAD)_ message score, if PAD is included and is either
spaces or
zeroes, then pad scores with that many spaces or
zeroes
There may be another bug here. I tried a few days ago using a pad of ( 00)
and ended up with the space ignored. What I wanted
Currently SpamAssassin 3.10pre4 gets ALMOST all the
way through the tests.
Failed Test Stat Wstat Total Fail Failed List of Failed
--
--
---
t/meta.t 21 50.00% 1
Then you are clean, this is a
A big part (perhaps the biggest part) of rules development is the mass
check. Most anyone can develop a rule on their home system and see how they
*think* it works.
Some few (but not many) people can do a mass-check on their home system and
see how it *really* works - *for them*.
As proposed,
As rules are put into the sandboxes, they become part of svn. When the
nightly mass-checks are run, each person pulls the latest rules sandboxes
from svn and does their mass-check with all of those, then rsyncs the
results back up to the central site once the mass-check completes.
I think I
Duncan wrote:
I think the first point is the bigger one. Ultimately, Dan's sandbox
proposal may solve part of the not enough rules problem by making it
easier for people to contribute rules. But I'd like to hear from
potential rule submitters -- would this be a step in the right
direction? Is
I'd like to see if there's a way to combine the two somehow so that new
SVN commits that update sandbox rules, are immediately mass-checked alone.
However, I can't see a way to do that reliably from SVN commits alone,
because (for example) meta rules may depend on other rules that were not
What I miss most is a transparent dataset about every rule.
I'd like to know
- percentage of false positives
- percentage of flase negatives
- percentage of true positives
- percentage of true negatives
- number of mails checked for the results above
- standard deviation of the percentages
Sidney writes:
Perhaps we could use SVN to check in rule submissions so they are version
controlled and tracked, and have emails refer to file paths and version
numbers instead of attaching the rules. Would that be too complex for the
people we want to attract compared to mailing in sets of rules
Could the list be a semi-private one, with moderated subscription and
posting? That'd take care of rules in development being exposed
to spammers while they're still being worked on, at least partially.
The SARE list is private and invitation only for exactly these reasons.
You don't want to
Sidney writes:
Dealing with metarules and modifications to them presents a problem in any
case. How do we deal with person X submitting a modification to metarule A
and proposed rule A1, while person Y submits a different modification to
metarule A and proposed rule A2 while person Z submits
Dealing with metarules and modifications to them presents a problem in
any
case. How do we deal with person X submitting a modification to metarule
A
and proposed rule A1, while person Y submits a different modification to
metarule A and proposed rule A2 while person Z submits proposed
---
I guess that part of making the rule submission and test process nimble is
for the submitted rules to be independent of anything else. That makes
changing metarules less of a nimble process. That's fine, because metarules
are really just an optimization which can be implemented after the fact
Duncan earlier enscribed:
Masscheck has an interdependency option, although it increases the
checking
time. We use it on rules once they seem useful, but not usually in early
one-off checking.
I'm not sure what you mean by this. We have an overlap script which
does some of this -- is that
I'm *really worried* about proposals that involve mailing lists that
have only private archives and require moderator approval for
subscription. It just doesn't feel right for an open source project.
I understand the feeling. I'm trying to balance the obvious desire for a
completely public
I guess you'd have better data than I would; but I'm still having
trouble believing that Spammers are adjusting on that time frame.
Some do; not all do. However, the ones that can adjust in less than a day,
or maybe less than 2-3 days sometimes, tend to be some of the more prolific
spammers.
May I help?
(How will you folks decide)
Well, to paraphrase how we decide in SARE -- do something, we'll watch.
And it really is pretty much that simple.
I expect (and this is personal opinion, I'm not an SA dev) that the rules
subproject will sooner or later consist of annointed
I know user rules aren't real popular with the sa dev community, however
that attitude isn't universally shared by sa users. Therefore may I
suggest:
Would it be possible when reorganizing things to come up with some
semi-persistant storage for compiled user rules, so that they don't have to
be
it's not a matter of popularity -- it's a matter of being horrendously
difficult to support.
I grant from what I've seen of PMS that this gets pretty ugly. Or at least
it seems to to me, but then a lot of apparently good Perl looks pretty ugly
to me. ;-) But I'm a C++ and Algol programmer,
That's why we use 70_sare_name_eng.cf files, to indicate that these
rules work well only on systems which expect almost 100% English ham,
and little to no ham in other languages.
I've begun to wonder whether it might be worth while having
50_scores.cf for English emails, and then
How would we determine ham/spam? At this point all we have is SA's
first estimation, and no way of knowing whether this is accurate, FN,
or FP.
All we could reasonably do is take SA's assment of the message and assume that
statistically it will be correct to one or two sigma or so. If the
More thought ... what if SA systems were to accumulate daily
statistics, along the lines of one record for each rule, containing:
That sounds like the general sort of vague idea I had, fleshed out in more
detail.
Certainly the desirable goal is basically:
1 does this rule hit anything?
2 does
Example: I am currently writing a very FEW rules, some from
scratch and some by adapting the work or ideas of others from
such lists or web sites.
You have all convinced me that if I post a rule for discussion
that it is then close to worthless.
It depends on how you post it. And it may
a) what the heck are priorities, who sets them, and do they really have
any
justifiable purpose? Ie: can they just quietly vanish into the night
with
nobody being any the wiser?
They order the rules -- or more correctly, sets of rules.
Most rules are priority 500 (iirc), but some need
I was thinking about the 'best' wat to shortcut running rules when they
weren't needed, and suddenly realized there might be cases where it is
necessary to run them even though they won't determine the hammyness or
spammyness of the mail.
In particular, I'm wondering about bayes and awl
It seems obvious that we want to run that -100 rule first. If it hits, the
maximum possible score if *every* other rule hits will be 4, and with a
threshold of 5, the mail can't be spam. So we can stop after the -100 rule
hits, and only run one rule on this mail.
This just brought up an
+score BAYES_50 0 0 0.845 0.001 # n=1
+score BAYES_60 0 0 2.312 0.372 # n=1
+score BAYES_80 0 0 2.775 2.087 # n=1
+score BAYES_95 0 0 3.023 2.063 # n=1
+score BAYES_99 0 0 2.960 1.886 # n=1
I think the score for BAYES_99 should be hand tweaked, regardless of what the
score generator said.
This
naming isn't really much of a big deal but it'd be nice to have some way
to keep track of that. (not that I can think of it.)
Look at some of the SARE rule files that Bob maintains. He has a formalized
set of comments that get stuck to rules, and one of these can/does show the
history
You need to ask this question on the users list. This list is to discuss
spamassassin development.
Are you SURE that was a valid message? If so, it will be the first recorded
instance of X-Message-Info showing up in ham and not only in spam.
Previously that had been a sure sign of a spam tool generated mail.
This is quite similar to two recent bugs that caused similar problems if
certain ascii characters immediately followed the URI. Spammers had
exploited at least one of those cases. I don't know what the fix was for
those bugs, but it may have been similar to the change you propose.
Loren
Could you please point this thread at the two bug numbers? I'd like to
target these for a future 3.0.5 bug-fix release, because we are very
unlikely able to upgrade our Enterprise distro to 3.1 in the short to
medium term. (I am hoping in the long term to have both RHEL4 and RHEL5
on
Agree in general, but possibly...
2. code-tied rules stay with main tree in current rules directory with
the exception of 25_replace.cf which is really just another way to
write body/header rules (basically, the static stuff that is tied to
code does not move to the rules project)
How big are they? SA is set up to bypass messages over a given size.
The following functions, immediately after they all
Mail::SpamAssassin::Message::Node::decode, need to call a
function that does charset normalization.
* Mail::SpamAssassin::Message::get_rendered_body_text_array
* Mail::SpamAssassin::Message::get_visible_rendered_body_text_array
*
Just looking from the sidelines, it seems the obvious answer would be to add
a new namespace to the blacklist. eg:
*.2.1.9.ipv6.rbl.example.org.
instead of
*.2.1.9.rbl.example.org.
Since this is for numeric lookups, and alpha or alphanum tag in what would
be the high octet of the ipv4 dotted
Justin writes:
I think we don't even need to do that; once we get the search directories
recursively code worked out for configuration and rules, plugins will be
loadable from *any* directory in the rules project:
ROOT/rules/group/20_name_of_file.cf
I *think* what Daniel was thinking of here, which should work, is
just using the ifversion commands to conditionalize too-advanced
rules.
Assuming ifversion can be used in the negative also. For instance, we have
one set of meta rules that use addition post-whatever, and do a less-good
job
Better asked on the user's list, where there are people running systems like
that.
Loren
Note also
echo score MICROSOFT_EXECUTABLE 4 .spamassassin/user_prefs
Isn't that a 2.6x rule that went away in 3.0? I would hope that anything
comparing filtering results (as I would guess this to be, knowing nothing of
it) would be using a reasonably recent version.
(Of course it would
As ancedotal evidence, its my belief that people are seeing _alarm_ log
records and associated scan failures on both rc1 and rc2, and that they are
occuring with more than just Pyzor. This is anecdotal however, I don't have
any evidence to hand to support that.
I'm personally wondering if this
Well, user rules are always allowed when 'spamassassin' is run so a --lint
message would have to say if you plan on using spamd your user rules
won't be
used.
On the other hand, spamd when called with -Dconfig, will tell you it's not
parsing each of your user rules.
So... do we really want
Please let me know what you think!
Daryl and Chris both make a number of good points, but the buildbot idea
also seems to have a good deal of merit. A creative solution for the
'private corpus' problem that Chris mentions might help a lot though.
Unfortunately I don't have one at the moment,
You know, I don't know if there'd be a separate bugzilla. good
question... I think the mostly likely thing would be that the rules
project stuff would be under the (existing) Rules component in BZ.
I don't know that BZ would get much use or be of much use in day to day
rules testing and
Some random comments:
So the idea is that the source code for all rules (apart from the legacy
core and lang sets) remains in the sandbox dirs; in other words, there's
no need to cut and paste and move around rules when they're promoted
from testing status, to live core status.
I'm not
Not too important, but the quip software is dumping SQL debug
info:
Maybe that depends on what you are doing. I tried to log in
unsuccessfully:
Software error:DBD::mysql::st execute failed: You have an error in your SQL syntax near '' at line 1 [for Statement "SELECT login_name FROM
Now that I can log is, I see why it isn't really important.
Loren
Hum. Is there any way to configure some default colors for the graph? On a
PC it seems Quicktime prints the thing out, and it is near unreadable. I
see a black square with a straight yellow line in the center and some wiggly
lines near the bottom. I *think* there might be some text in the
Not in my case Tom. I actually have all the Bayes features disabled and
the
error still happened on my installation.
But do you have AWL disabled too?
I suppose mkrules could be changed to cat all the files parsed so far,
so that a sandbox file can refer to a core file's rule by name (since
sandbox will be compiled after core); but I quite like the side-effect of
restricting sandbox files to only being able to affect rules in their own
'As a collaborative documentation platform, the wiki has already proved
much more
effective than our SVN codebase.'
So why not write a routine to scrape the Wiki on the day of release and
stick the pages into files in the release tree?
Loren
Looks generally good. Minor comments:
1. Bob had a thing built into his version of mass-check that assigns default
scores. I'm not clear on the basis for this (although he has explained it
any number of times) but it is fairly simple and seems todo a decent job,
shy of a full scoring run.
I'm
Hello Warren,
There was also a recent discussion about using SVM scoring techniques, and
someone posted a tool to do that. I believe the claim was that it produced
reasonable scoring with less effort than the normal method. Perhaps that
could be used here?
Loren
Converting sections of tests into plugins where some people will want to
disable the entire set due to performance, memory, or similar
constraints (i.e., Bayes tests, network tests, special functionality,
etc.) does make sense. However, converting individual (or nearly
individual) tests that
Whether your idea is good or not, it has to do
with a suggestion for how to use sa-learn, not anything to do with
development.
Hi Sidney, happy new year!
Actually, while he phrased the RFE in terms of sa-learn, it is actually
something that could be done as an SA plugin, if SA were run on the
You can do that with the plain regex rules thanks to the experimental
and rather loony (?{...}) and (??{...}) constructs.
Well no. You could do that on 2.6x, and I used that for some very valuable
rule development tools. That ability was removed in 3.x.
Loren
anyway, I've just checked in a change that'll allow hit-rates
all the way down to 0.02%. why not. ;)
I guess I question active hitrates much under 1%. The key there is
'active'. Things that may be hitting next to nothing in one corpus might be
hitting well in another one.
Loren
IMO, bugs which allow any specially crafted spammy message to get
through, even if the method used is to crash spamd or stand-alone SA,
is NOT a security bug, provided the only damage is to SA/spamd and the
resulting FN. That's a bug, pure and simple, no matter how creative
the spammer is.
At a guess: IE and apparently Firefox have search for url enabled by
default. In IE that consists of sticking .com, .net, etc suffixes on, and I
think trying a www. prefix. From a report on the user's list, it appears
that Firefox goes farther and will do a google search, resulting in a
tinyurl
As an outsider, I find myself strongly agreeing with Motohraru-san that,
when dealing with at least the oriental multibyte languages, tokinization
belongs early in the stream, before both bayes and rules.
Of course this is an overhead penalty that should not occur on mail that
isn't likely to be
in other words it's been dropping from a high of 19.348% of spam to just
0.38%
nowadays.
Which isn't to say that there aren't unique ids in modern subjects. They
just aren't in a form this can detect. :-)
Loren
default_rules_path (/usr/share/spamassassin)
site_rules_path (/etc/mail/spamassassin)
default_userprefs_path (~/.spamassassin/user_prefs)
Doesn't that imply that site rules override local rules? Surely those are
in the other order? Or is there magic when reading the second file
Should we be wrapping full rules in alarms (using M::SA::Timeout) to
prevent this?
You can do this with any rule, a full rule is just easier to mess up.
I'd be concerned of the overhead (and probable timing holes) in wrapping
every rule in an alarm().
As an alternative, how about wrappring
One big plugin would be better than the current split. The current
split has no solid technical rationale behind it.
- allows eval rules to not be loaded. arguably, most of them will always
be
enabled, but some could be disabled. DNSEval, for instance, is only
useful
in net mode. If
of Bayes and URIBL. There would probably be a much lower-overhead solution,
say SpamBayes, if SA's rules capability is effectively removed. Which seems
to be the effective intent of this proposal.
Loren Wilton
If it's a plugin, it has to be a code-tied rule! Otherwise it wouldn't
need
the plugin.
Hey, what a neat way to completely disable the initial concept of the Rules
project and put things back into the Land Of Arcana where they belong!
Just move 'body', 'rawbody', 'header', and 'full' to
1 - 100 of 138 matches
Mail list logo