Hallo all,

I am in the middle of doing quality assurance for DSPAM 3.9.0 and have found 
some issues that I have addressed but not yet commited to GIT.

One such issue is the way the new HTML stripper works. I made it too well :). 
Resulting in removed HTML tags that could degenerate a message to have a low 
count of text that can be tokenized. I addressed that issue by trying even 
harder to find data in the removed HTML tokens that could potentially be usable 
for the tokenizer (for example: http/https/ftp urls in various tags). This 
helps to lower the failures on the test corpus I am using (2005 TREC Public 
Spam Corpus).

What I can tell now is that the current code in DSPAM 3.9.0 BETA4 results in 
the following for the TREC05 full corpus (using OSB tokenizer, bcr for PValue 
and graham+burton for Algorithm):
Total messages in full index: 92'189
Total messages resulting in broken class: 2'905

This is 3.2% failures on all the messages found in the corpus. I find that 
number not very low. Running the test again with the modified code results in 
this here:
Total messages in full index: 92'189
Total messages resulting in broken class: 155

That is now just 0.17% failures on all the messages. Almost 20 times less then 
before. Not bad. Failures btw don't mean that something must be worong with 
DSPAM. I have a certain message limit in DSPAM set and if that limit is reached 
the message gets delivered without any output. (I think DSPAM should deliver 
some output at least if deliver is set to summary and/or maybe when using 
--stdout. Maybe a new output like: result="OverMaxMessageSize"; 
class="Uncertain"; probability=1.0000; confidence=1.00;).

However... when looking at some messages that produce errors, I see that often 
the messages them self are not valid. For example:
 * data/002/010
 * data/091/021

They are just empty messages without any body part. They are totally broken. 
Normally a empty line separates a header from the body but those mails have no 
body part. Not even empty. And libdspam is bailing on them (wich is IMHO the 
right thing to do).

I don't like failures. Not in that context. While thinking about how to prevent 
that error I remembered the option DataSource in DSPAM. I could try to use that 
mechanism to switch DSPAM into processing the whole message as one single body. 
This could allow DSPAM to still tokenize a message that is not a valid email 
message, instead of bailing with EINVAL (error code 22).

What do you guys think? What would you expect from DSPAM in that case? From my 
personal viewpoint I would say that a corrupt message is a corrupt message and 
that's it. DSPAM does not just fail without a reason. But from a 
production/training viewpoint I would say that allowing the DSPAM operator to 
choose how to handle such a case could be a nice thing. What is your oppinion?


-- 
Kind Regards from Switzerland,

Stevan Bajić

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Dspam-devel mailing list
Dspam-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-devel

Reply via email to