Ok it's good to write some documentation, now it's a bit more clear to me. :-)
https://cwiki.apache.org/confluence/display/SPAMASSASSIN/WritingRulesAdvanced My bad, __BODY_STARTS_WITH_FROM_LINE was of course fine, 'body' matches against paragraphs normalized to lines, there are no newlines (except in the end). So /m is always useless for body! Rawbody contains raw lines, /m was needed for __HAS_IMG_SRC etc since used multiple to count things. Without, only one count was made per few kB chunk. TVD_DEAR_HOMEOWNER etc match at the start of a paragraph. Should be more robust written as /^\s?dear homeowner/i since beginning space is not trimmed. Rules like DEAR_EMAIL and DEAR_EMAIL_USER take this in account, good. :-) TVD_LONG_WORD5 dunno what this is supposed to do. It's enough to hit a single paragraph containing "foo .", prolly remove this as ruleqa is bad. __TWO_WORD_LINES does not work as name suggests, as it's trying to match complete rawbody chunks. It only matches when the complete mimepart only contains two words. IMG_ALT_BRACKETS, dunno why it's anchored.. The concept of body paragraphs is somewhat troubling, since DEAR_EMAIL will not match this for example: ------------------- Dear foo, Spam me at foo@bar ------------------- Only solution for that would be making separate subrules to match both Dear and email etc, but it can't see how "near" the matches are to each other. So we should really implement some new ruletypes or tflags for 4.0.0, like fullbody and fullrawbody where the text is not split to chunks in any way. It's really not a problem to match 50-500k blobs (*_part_size_limit) these days. Of course they would be only used when necessary, like "full" rules already. On Wed, Aug 07, 2019 at 03:48:00PM +0300, Henrik K wrote: > > Guys, remember that if you use /^foo/ with body or rawbody, you most likely > need to use /m !!! > > Fixed few blatant ones, theres many more, atleast check these out > > rulesrc/sandbox/felicity/70_other.cf:body TVD_DEAR_HOMEOWNER /^dear > homeowner/i > rulesrc/sandbox/felicity/70_other.cf:body TVD_LONG_WORD5 > /^(?:(?:\w+,?\s+)\.)+\s*$/ > rulesrc/sandbox/felicity/70_other.cf:body TVD_SPACED_WORDS > /^(?:[A-Z]\s)+[a-z]\s(?:[A-Z]\s)+$/ > rulesrc/sandbox/jm/20_basic.cf:body __BODY_STARTS_WITH_FROM_LINE /^From \S+ > \S\S\S \S\S\S .. ..:..:.. \S+\s+\S+\: /s > rulesrc/sandbox/jm/20_basic.cf:rawbody __HS_QUOTE /^> / > rulesrc/sandbox/fanf/30_text.cf:rawbody IMG_ALT_BRACKETS /^<img > src="cid:7\.1\.0\.9\.[^"]+\.0" width=\d+ height=\d+ alt="[[][]]">/ > rulesrc/sandbox/khopesh/20_khop_general.cf:body DEAR_EMAIL > /^\s*Dear\b.{0,70}\w\@\w/i > rulesrc/sandbox/jhardin/20_MIME_in_body.cf: body __MIME_CTYPE_IN_BODY > /^Content-Type:\s/ > rulesrc/sandbox/jhardin/20_misc_testing.cf:body DEAR_EMAIL_USER > /^\s?(?:Dear\s|Attention:?\s?)(?:E|Web)-?mail\s(?:account\s)?User\b/i > rulesrc/sandbox/jhardin/20_misc_testing.cf:body __FBI_BODY_SHOUT_1 > /^FEDERAL BUREAU OF INVESTIGATIONS?\b/ > rulesrc/sandbox/jhardin/20_misc_testing.cf:body __BODY_TEXT_LINE > /^\s*\S/ > rulesrc/sandbox/jhardin/20_misc_testing.cf:body __SINGLE_WORD_LINE > /^\s?\S{1,60}\s?$/ > rulesrc/sandbox/dos/70_other.cf:body __DOS_HI /^Hi,$/ > rulesrc/sandbox/maddoc/99_doc_test.cf:rawbody __TWO_WORD_LINES /^\S+\s+\S+$/ > rulesrc/sandbox/maddoc/99_fsl_testing.cf:rawbody FSL_BOTSPAM_1 > /^[^\n]+\nhttp:\/\/[^\n]+\.ru\/\n$/s > > Remember that body always has a subject line as first line, many of these > are basically trying to match only subject... > > > > On Wed, Aug 07, 2019 at 12:45:49PM -0000, [email protected] wrote: > > Author: hege > > Date: Wed Aug 7 12:45:49 2019 > > New Revision: 1864618 > > > > URL: http://svn.apache.org/viewvc?rev=1864618&view=rev > > Log: > > Fix some missing regex /m > > > > Modified: > > spamassassin/trunk/rulesrc/sandbox/billcole/80_test.cf > > spamassassin/trunk/rulesrc/sandbox/jm/20_basic.cf > > > > Modified: spamassassin/trunk/rulesrc/sandbox/billcole/80_test.cf > > URL: > > http://svn.apache.org/viewvc/spamassassin/trunk/rulesrc/sandbox/billcole/80_test.cf?rev=1864618&r1=1864617&r2=1864618&view=diff > > ============================================================================== > > --- spamassassin/trunk/rulesrc/sandbox/billcole/80_test.cf (original) > > +++ spamassassin/trunk/rulesrc/sandbox/billcole/80_test.cf Wed Aug 7 > > 12:45:49 2019 > > @@ -1,19 +1,19 @@ > > # These are oddities seen in Other People's Spam, i.e. I have no hits in > > my test corpora > > > > describe __HAS_IMG_SRC Has an img tag on a non-quoted line > > -rawbody __HAS_IMG_SRC /^[^>].*?<img src=/i > > +rawbody __HAS_IMG_SRC /^[^>].*?<img src=/im > > tflags __HAS_IMG_SRC multiple maxhits=100 > > > > describe __HAS_HREF Has an anchor tag with a href attribute in > > non-quoted line > > -rawbody __HAS_HREF /^[^>].*?<a href=/i > > +rawbody __HAS_HREF /^[^>].*?<a href=/im > > tflags __HAS_HREF multiple maxhits=100 > > > > describe __HAS_IMG_SRC_ONECASE Has an img tag on a non-quoted line > > with consistent case > > -rawbody __HAS_IMG_SRC_ONECASE /^[^>].*?<(img src|IMG SRC)=/ > > +rawbody __HAS_IMG_SRC_ONECASE /^[^>].*?<(img src|IMG SRC)=/m > > tflags __HAS_IMG_SRC_ONECASE multiple maxhits=100 > > > > describe __HAS_HREF_ONECASE Has an anchor tag with a href attribute > > in non-quoted line with consistent case > > -rawbody __HAS_HREF_ONECASE /^[^>].*?<(a href|A HREF)=/ > > +rawbody __HAS_HREF_ONECASE /^[^>].*?<(a href|A HREF)=/m > > tflags __HAS_HREF_ONECASE multiple maxhits=100 > > > > describe __MIXED_IMG_CASE Has img tags with mixed-up cases in > > non-quoted lines > > > > Modified: spamassassin/trunk/rulesrc/sandbox/jm/20_basic.cf > > URL: > > http://svn.apache.org/viewvc/spamassassin/trunk/rulesrc/sandbox/jm/20_basic.cf?rev=1864618&r1=1864617&r2=1864618&view=diff > > ============================================================================== > > --- spamassassin/trunk/rulesrc/sandbox/jm/20_basic.cf (original) > > +++ spamassassin/trunk/rulesrc/sandbox/jm/20_basic.cf Wed Aug 7 12:45:49 > > 2019 > > @@ -31,7 +31,7 @@ header JM_RCVD_QMAILV1 Received =~ / > > # From [email protected] Mon Jun 19 14:15:23 2006 > > # Header2: blah > > > > -body __BODY_STARTS_WITH_FROM_LINE /^From \S+ \S\S\S \S\S\S .. ..:..:.. > > \S+\s+\S+\: /s > > +body __BODY_STARTS_WITH_FROM_LINE /^(?:[^\n]*\n)?From \S+ \S\S\S \S\S\S .. > > ..:..:.. \S+\s+\S+\: /s > > meta CORRUPT_FROM_LINE_IN_HDRS (MISSING_HEADERS && > > __BODY_STARTS_WITH_FROM_LINE && MISSING_DATE && NO_RELAYS) > > describe CORRUPT_FROM_LINE_IN_HDRS Informational: message is corrupt, with > > a From line in its headers > > > >
