On Fri, 09 Jul 2010 16:39:30 +0100, Paul Cockings <ds...@cytringan.co.uk> wrote: > I've seen an increase in the number of Subject fields with text like this: > > Extraordinary_Special_Deals_Available_on_Onshore > June's_WELL_RED:_Viva_Las_Vegas! > > How do subjects like this work with the Dspam tokenisers? I'm assuming > we would see it as one word. Is there there a case to think about using
> the underscore as a delimiter.... > > (just a thought) > DSPAM uses " .,;:\n\...@-+*" as delimiters for body and " ,;:\n\...@-+*" as delimiters for headers. See here: http://dspam.git.sourceforge.net/git/gitweb.cgi?p=dspam/dspam;a=blob;f=src/config.h;hb=HEAD#l56 John changed those delimiters on 2006-01-24 claiming that the reduced amount of delimiters has lead to higher accuracy: http://dspam.git.sourceforge.net/git/gitweb.cgi?p=dspam/dspam;a=blob;f=CHANGELOG;hb=HEAD#l1570 If it would be up to me then I would allow in dspam.conf to change those delimiters and not hard wire them into config.h. But then again... I see so many n00bs using DSPAM that most of them will anyway be way, way, way, way overstrained with such a technical configuration option that I fear exposing this kind of stuff in dspam.conf would lead to more problems then solving any real world problems. DSPAM anyway would drop that long subject and not tokenize it. If memory does not fool me then anything longer then 50 characters will be NOT tokenized but just ignored. So that long subject line is equal as if there would have not be any subject at all. On my CRM114 installation I am using this REGEXP for delimiters: # ---- What regex do we use for LEARN/CLASSIFY? the first is the # ---- "old standard". Other ones are handy for different spam # ---- mixes. The last one is for people who get a great deal of # ---- packed HTML spam, which is almost everybody in 2003, so it # ---- used to be the default. But since spammers have shifted away # ---- from this, it isn't the default any longer. IF you change # ---- this, you MUST rebuild your .css files with decent # ---- amounts of locally-grown spam and nonspam ( if you've been # ---- following instructions and using the "reaver" cache, this is # ---- easily done! ) # :lcr: /[[:graph:]]+/ #:lcr: /[[:alnum:]]+/ #:lcr: /[-.,:[:alnum:]]+/ #:lcr: /[[:graph:]][-[:alnum:]]*[[:graph:]]?/ #:lcr: /[[:graph:]][-.,:[:alnum:]]*[[:graph:]]?/ # # this next one is pretty incomprehensible, and probably wrong... #:lcr: /[[:print:]][/!?\#]?[-[[:alnum:]][[:punct:]]]*(?:[*'=;]|/?>|:/*)? However... in DSPAM we don't have a REGEXP engine included so we can not just go and use [[::graph]]+ as delimiter. -- Kind Regards from Switzerland, Stevan Bajić ------------------------------------------------------------------------------ This SF.net email is sponsored by Sprint What will you do first with EVO, the first 4G phone? Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first _______________________________________________ Dspam-devel mailing list Dspam-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspam-devel