Stevan Bajić wrote: > Hallo all, > > I am in the middle of doing quality assurance for DSPAM 3.9.0 and have found > some issues that I have addressed but not yet commited to GIT. > > One such issue is the way the new HTML stripper works. I made it too well :). > Resulting in removed HTML tags that could degenerate a message to have a low > count of text that can be tokenized. I addressed that issue by trying even > harder to find data in the removed HTML tokens that could potentially be > usable for the tokenizer (for example: http/https/ftp urls in various tags). > This helps to lower the failures on the test corpus I am using (2005 TREC > Public Spam Corpus). > > What I can tell now is that the current code in DSPAM 3.9.0 BETA4 results in > the following for the TREC05 full corpus (using OSB tokenizer, bcr for PValue > and graham+burton for Algorithm): > Total messages in full index: 92'189 > Total messages resulting in broken class: 2'905 > > This is 3.2% failures on all the messages found in the corpus. I find that > number not very low. Running the test again with the modified code results in > this here: > Total messages in full index: 92'189 > Total messages resulting in broken class: 155 > > That is now just 0.17% failures on all the messages. Almost 20 times less > then before. Not bad. Failures btw don't mean that something must be worong > with DSPAM. I have a certain message limit in DSPAM set and if that limit is > reached the message gets delivered without any output. (I think DSPAM should > deliver some output at least if deliver is set to summary and/or maybe when > using --stdout. Maybe a new output like: result="OverMaxMessageSize"; > class="Uncertain"; probability=1.0000; confidence=1.00;). > > However... when looking at some messages that produce errors, I see that > often the messages them self are not valid. For example: > * data/002/010 > * data/091/021 > > They are just empty messages without any body part. They are totally broken. > Normally a empty line separates a header from the body but those mails have > no body part. Not even empty. And libdspam is bailing on them (wich is IMHO > the right thing to do). > > I don't like failures. Not in that context. While thinking about how to > prevent that error I remembered the option DataSource in DSPAM. I could try > to use that mechanism to switch DSPAM into processing the whole message as > one single body. This could allow DSPAM to still tokenize a message that is > not a valid email message, instead of bailing with EINVAL (error code 22). > > What do you guys think? What would you expect from DSPAM in that case? From > my personal viewpoint I would say that a corrupt message is a corrupt message > and that's it. DSPAM does not just fail without a reason. But from a > production/training viewpoint I would say that allowing the DSPAM operator to > choose how to handle such a case could be a nice thing. What is your oppinion? > > > Hello, Stevan.
I would say that invalid messages such as those you spoke should be put into quarantine and tagged as 'Invalid' or 'Corrupt', so the user can decide to receive them later. In fact, this is how dspam handles viruses, right? No tokenizing, just put in quarantine and tagged. What would be the advantage of tokenizing such corrupt messages? Best Regards, Carlo Rodrigues ------------------------------------------------------------------------------ Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july _______________________________________________ Dspam-devel mailing list Dspam-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspam-devel