Ok it's good to write some documentation, now it's a bit more clear to me. 
:-)

https://cwiki.apache.org/confluence/display/SPAMASSASSIN/WritingRulesAdvanced

My bad, __BODY_STARTS_WITH_FROM_LINE was of course fine, 'body' matches
against paragraphs normalized to lines, there are no newlines (except in the
end).  So /m is always useless for body!

Rawbody contains raw lines, /m was needed for __HAS_IMG_SRC etc since used
multiple to count things.  Without, only one count was made per few kB
chunk.

TVD_DEAR_HOMEOWNER etc match at the start of a paragraph.  Should be more
robust written as /^\s?dear homeowner/i since beginning space is not
trimmed.  Rules like DEAR_EMAIL and DEAR_EMAIL_USER take this in account,
good. :-)

TVD_LONG_WORD5 dunno what this is supposed to do.  It's enough to hit a
single paragraph containing "foo .", prolly remove this as ruleqa is bad.

__TWO_WORD_LINES does not work as name suggests, as it's trying to match
complete rawbody chunks.  It only matches when the complete mimepart only
contains two words.

IMG_ALT_BRACKETS, dunno why it's anchored..

The concept of body paragraphs is somewhat troubling, since DEAR_EMAIL will
not match this for example:

-------------------
Dear foo,

Spam me at foo@bar
-------------------

Only solution for that would be making separate subrules to match both Dear
and email etc, but it can't see how "near" the matches are to each other.

So we should really implement some new ruletypes or tflags for 4.0.0, like
fullbody and fullrawbody where the text is not split to chunks in any way. 
It's really not a problem to match 50-500k blobs (*_part_size_limit) these
days.  Of course they would be only used when necessary, like "full" rules
already.


On Wed, Aug 07, 2019 at 03:48:00PM +0300, Henrik K wrote:
> 
> Guys, remember that if you use /^foo/ with body or rawbody, you most likely
> need to use /m !!!
> 
> Fixed few blatant ones, theres many more, atleast check these out
> 
> rulesrc/sandbox/felicity/70_other.cf:body TVD_DEAR_HOMEOWNER         /^dear 
> homeowner/i
> rulesrc/sandbox/felicity/70_other.cf:body TVD_LONG_WORD5             
> /^(?:(?:\w+,?\s+)\.)+\s*$/
> rulesrc/sandbox/felicity/70_other.cf:body TVD_SPACED_WORDS   
> /^(?:[A-Z]\s)+[a-z]\s(?:[A-Z]\s)+$/
> rulesrc/sandbox/jm/20_basic.cf:body __BODY_STARTS_WITH_FROM_LINE /^From \S+ 
> \S\S\S \S\S\S .. ..:..:.. \S+\s+\S+\: /s
> rulesrc/sandbox/jm/20_basic.cf:rawbody __HS_QUOTE /^> /
> rulesrc/sandbox/fanf/30_text.cf:rawbody IMG_ALT_BRACKETS /^<img 
> src="cid:7\.1\.0\.9\.[^"]+\.0" width=\d+ height=\d+ alt="[[][]]">/
> rulesrc/sandbox/khopesh/20_khop_general.cf:body  DEAR_EMAIL 
> /^\s*Dear\b.{0,70}\w\@\w/i
> rulesrc/sandbox/jhardin/20_MIME_in_body.cf: body        __MIME_CTYPE_IN_BODY  
>   /^Content-Type:\s/
> rulesrc/sandbox/jhardin/20_misc_testing.cf:body           DEAR_EMAIL_USER     
>      /^\s?(?:Dear\s|Attention:?\s?)(?:E|Web)-?mail\s(?:account\s)?User\b/i
> rulesrc/sandbox/jhardin/20_misc_testing.cf:body        __FBI_BODY_SHOUT_1   
> /^FEDERAL BUREAU OF INVESTIGATIONS?\b/
> rulesrc/sandbox/jhardin/20_misc_testing.cf:body        __BODY_TEXT_LINE     
> /^\s*\S/
> rulesrc/sandbox/jhardin/20_misc_testing.cf:body        __SINGLE_WORD_LINE  
> /^\s?\S{1,60}\s?$/
> rulesrc/sandbox/dos/70_other.cf:body __DOS_HI                   /^Hi,$/
> rulesrc/sandbox/maddoc/99_doc_test.cf:rawbody __TWO_WORD_LINES /^\S+\s+\S+$/
> rulesrc/sandbox/maddoc/99_fsl_testing.cf:rawbody  FSL_BOTSPAM_1   
> /^[^\n]+\nhttp:\/\/[^\n]+\.ru\/\n$/s
> 
> Remember that body always has a subject line as first line, many of these
> are basically trying to match only subject...
> 
> 
> 
> On Wed, Aug 07, 2019 at 12:45:49PM -0000, [email protected] wrote:
> > Author: hege
> > Date: Wed Aug  7 12:45:49 2019
> > New Revision: 1864618
> > 
> > URL: http://svn.apache.org/viewvc?rev=1864618&view=rev
> > Log:
> > Fix some missing regex /m
> > 
> > Modified:
> >     spamassassin/trunk/rulesrc/sandbox/billcole/80_test.cf
> >     spamassassin/trunk/rulesrc/sandbox/jm/20_basic.cf
> > 
> > Modified: spamassassin/trunk/rulesrc/sandbox/billcole/80_test.cf
> > URL: 
> > http://svn.apache.org/viewvc/spamassassin/trunk/rulesrc/sandbox/billcole/80_test.cf?rev=1864618&r1=1864617&r2=1864618&view=diff
> > ==============================================================================
> > --- spamassassin/trunk/rulesrc/sandbox/billcole/80_test.cf (original)
> > +++ spamassassin/trunk/rulesrc/sandbox/billcole/80_test.cf Wed Aug  7 
> > 12:45:49 2019
> > @@ -1,19 +1,19 @@
> >  # These are oddities seen in Other People's Spam, i.e. I have no hits in 
> > my test corpora 
> >  
> >  describe   __HAS_IMG_SRC   Has an img tag on a non-quoted line
> > -rawbody            __HAS_IMG_SRC   /^[^>].*?<img src=/i
> > +rawbody            __HAS_IMG_SRC   /^[^>].*?<img src=/im
> >  tflags             __HAS_IMG_SRC   multiple maxhits=100
> >  
> >  describe   __HAS_HREF      Has an anchor tag with a href attribute in 
> > non-quoted line
> > -rawbody            __HAS_HREF      /^[^>].*?<a href=/i
> > +rawbody            __HAS_HREF      /^[^>].*?<a href=/im
> >  tflags             __HAS_HREF      multiple maxhits=100
> >  
> >  describe   __HAS_IMG_SRC_ONECASE   Has an img tag on a non-quoted line 
> > with consistent case
> > -rawbody            __HAS_IMG_SRC_ONECASE   /^[^>].*?<(img src|IMG SRC)=/
> > +rawbody            __HAS_IMG_SRC_ONECASE   /^[^>].*?<(img src|IMG SRC)=/m
> >  tflags             __HAS_IMG_SRC_ONECASE   multiple maxhits=100
> >  
> >  describe   __HAS_HREF_ONECASE      Has an anchor tag with a href attribute 
> > in non-quoted line with consistent case
> > -rawbody            __HAS_HREF_ONECASE      /^[^>].*?<(a href|A HREF)=/
> > +rawbody            __HAS_HREF_ONECASE      /^[^>].*?<(a href|A HREF)=/m
> >  tflags             __HAS_HREF_ONECASE      multiple maxhits=100
> >  
> >  describe   __MIXED_IMG_CASE        Has img tags with mixed-up cases in 
> > non-quoted lines
> > 
> > Modified: spamassassin/trunk/rulesrc/sandbox/jm/20_basic.cf
> > URL: 
> > http://svn.apache.org/viewvc/spamassassin/trunk/rulesrc/sandbox/jm/20_basic.cf?rev=1864618&r1=1864617&r2=1864618&view=diff
> > ==============================================================================
> > --- spamassassin/trunk/rulesrc/sandbox/jm/20_basic.cf (original)
> > +++ spamassassin/trunk/rulesrc/sandbox/jm/20_basic.cf Wed Aug  7 12:45:49 
> > 2019
> > @@ -31,7 +31,7 @@ header JM_RCVD_QMAILV1     Received =~ /
> >  #   From [email protected]  Mon Jun 19 14:15:23 2006
> >  #   Header2: blah
> >  
> > -body __BODY_STARTS_WITH_FROM_LINE /^From \S+ \S\S\S \S\S\S .. ..:..:.. 
> > \S+\s+\S+\: /s
> > +body __BODY_STARTS_WITH_FROM_LINE /^(?:[^\n]*\n)?From \S+ \S\S\S \S\S\S .. 
> > ..:..:.. \S+\s+\S+\: /s
> >  meta CORRUPT_FROM_LINE_IN_HDRS (MISSING_HEADERS && 
> > __BODY_STARTS_WITH_FROM_LINE && MISSING_DATE && NO_RELAYS)
> >  describe CORRUPT_FROM_LINE_IN_HDRS Informational: message is corrupt, with 
> > a From line in its headers
> >  
> > 

Reply via email to