On Fri, 09 Jul 2010 16:39:30 +0100, Paul Cockings <ds...@cytringan.co.uk>
wrote:
> I've seen an increase in the number of Subject fields with text like
this:
> 
> Extraordinary_Special_Deals_Available_on_Onshore
> June's_WELL_RED:_Viva_Las_Vegas!
> 
> How do subjects like this work with the Dspam tokenisers?  I'm assuming 
> we would see it as one word.  Is there there a case to think about using

> the underscore as a delimiter....
> 
> (just a thought)
> 
DSPAM uses " .,;:\n\...@-+*" as delimiters for body and " ,;:\n\...@-+*"
as delimiters for headers. See here:
http://dspam.git.sourceforge.net/git/gitweb.cgi?p=dspam/dspam;a=blob;f=src/config.h;hb=HEAD#l56

John changed those delimiters on 2006-01-24 claiming that the reduced
amount of delimiters has lead to higher accuracy:
http://dspam.git.sourceforge.net/git/gitweb.cgi?p=dspam/dspam;a=blob;f=CHANGELOG;hb=HEAD#l1570

If it would be up to me then I would allow in dspam.conf to change those
delimiters and not hard wire them into config.h. But then again... I see so
many n00bs using DSPAM that most of them will anyway be way, way, way, way
overstrained with such a technical configuration option that I fear
exposing this kind of stuff in dspam.conf would lead to more problems then
solving any real world problems.

DSPAM anyway would drop that long subject and not tokenize it. If memory
does not fool me then anything longer then 50 characters will be NOT
tokenized but just ignored. So that long subject line is equal as if there
would have not be any subject at all.

On my CRM114 installation I am using this REGEXP for delimiters:
#   ---- What regex do we use for LEARN/CLASSIFY?  the first is the
#   ---- "old standard".  Other ones are handy for different spam
#   ---- mixes.  The last one is for people who get a great deal of
#   ---- packed HTML spam, which is almost everybody in 2003, so it
#   ---- used to be the default.  But since spammers have shifted away
#   ---- from this, it isn't the default any longer.  IF you change
#   ---- this, you MUST rebuild your .css files with decent
#   ---- amounts of locally-grown spam and nonspam ( if you've been
#   ---- following instructions and using the "reaver" cache, this is
#   ---- easily done! )
#
:lcr: /[[:graph:]]+/
#:lcr: /[[:alnum:]]+/
#:lcr: /[-.,:[:alnum:]]+/
#:lcr: /[[:graph:]][-[:alnum:]]*[[:graph:]]?/
#:lcr: /[[:graph:]][-.,:[:alnum:]]*[[:graph:]]?/
#
#  this next one is pretty incomprehensible, and probably wrong...
#:lcr: /[[:print:]][/!?\#]?[-[[:alnum:]][[:punct:]]]*(?:[*'=;]|/?>|:/*)?


However... in DSPAM we don't have a REGEXP engine included so we can not
just go and use [[::graph]]+ as delimiter.

-- 
Kind Regards from Switzerland,

Stevan Bajić

------------------------------------------------------------------------------
This SF.net email is sponsored by Sprint
What will you do first with EVO, the first 4G phone?
Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
_______________________________________________
Dspam-devel mailing list
Dspam-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-devel

Reply via email to