Hallo all,

have done the same test for TREC06 with the following result:

Messages over 4MB (my current MaxMessageSize):
theia full # find ../data/ -type f -size +4M | wc -l
1
theia full #

Messages having no body part:
theia full # find ../data -type f | while read foo ; do if [ $(sed '1,/^$/d' 
${foo} | wc -l) -lt 1 ] ; then echo ${foo} ; fi ; done | wc -l
411
theia full #

Result when training the full index:
Total messages in full index: 37'822
Total messages failing: 11
Failing percentage: 0.03%

I fiddeled yesterday around with messages that have no body and I was wrong 
with my assumption that processing the message as a document would fix the 
issue. I tested with one message from TREC05 having just 4 recieved mail 
headers and classifying that message failed regardless of what setting I had in 
DataSource. I found then out that adding more mail headers solves the issue. 
What headers is not important. They could be anything. So the problem is not 
the absence of the body part but the choosing of the Tokenizer and other 
settings. I use OSB and degenerating the message with 4 recieve headers 
resulted in no tokenizable content. The problem btw was not OSB alone but the 
used "IgnoreHeader Received" option I have in dspam.conf. That just eliminated 
everything from those 4 recieve headers resulting in 0 tokens. And DSPAM can't 
classify if it does not have any tokens to do the computation.

I am 100% sure that DSPAM 3.8.0 would as well fail with the same configuration. 
In fact I am sure that DSPAM 3.9.0 would do a better job in classifying mails 
since the new HTML stripper would do a better job in getting more tokenizable 
content out of HTML mails. So without testing I would say that DSPAM 3.9.0 
should beat DSPAM 3.8.0. But again. I need to test that.


=================================================
Now a new issue:
=================================================

While processing TREC06 I looked at the included corpus with Chinese messages 
and I realized one thing: DSPAM is totally bad for that language. The problem 
is not the language it self. DSPAM is not aware of the language and does not 
need to be aware of it. But the big problem with those mails is that most of 
them have a huge block of text that has no word boundery. And DSPAM is 
attempting to split words in order to tokenize them and produce tokens. But 
since GB2312, BIG5, Thai and any other language that does not have real word 
boundery, DSPAM can't split. So what happens is that DSPAM is tokenizing many 
words into one token and this is totally useless. Let me try to illustrate the 
problem by using English text but explaining how that would be handled in 
languages that have no word boundery.

English text: "To bail or not to bail?"

Now in Chinese this text would not be separated. Chinese do not have space as 
word separator. They use characters/symbols instead of words. So imagine all 
the spaces removed: "Tobailornottobail?"
And DSPAM is now tokenizing this single word. But in reallity this would be 6 
words but DSPAM does not know anything about that. It does not know that the 
language uses symbols and not words.

We need to fix that! The problem is that there is no real easy fix for it. 
Since those symbols/words could be anything from 1 to 4 characters long. It's 
hard for someone without reading the text to know when to break and when not 
(since the language anyway does not break symbols and any break would be 
artificial just for the tokenizer). I have read yesterday the specs from 
Unicode and GB2312 has a gazillion of characters/symbols. It's huge! And the 
bad thing: DSPAM has no Unicode handling. So we would need to implement that 
first and then use a method to transform GB2312, BIG5 or any other language 
that does not have word bounderies into Unicode and then try to break the 
symbols into small peaces of tokenizable content.

A easy way could be to say that we break after 2 bytes. We could go on and 
break everything into 1, 2, 3 and 4 byte words. Without knowing what we break. 
Just fix break at 1, 2, 3 and 4 bytes. Or we could code the whole character set 
from Unicode into DSPAM for those languages and then break based on this 
(GB2312 has around 7445 symbols. Big5 is even bigger).

btw: We are not the only one having this issue. Basically every search engine 
has that issue. I wonder how other Anti-Spam solutions handle that case?

I have yesterday quickly prototyped a function in Java that breaks a stream of 
symbols into tokenizable words by following the rules outlined by Dr. Herong 
Yang. But that's just for GB2312 and the other languages might have other 
rules. I don't know if I should invest more time into that? I feel like being 
the wrong person for that. Someone from the DSPAM community using those 
languages should help here.

There is so much to do and I can't handle all of that. We need more developers. 
Maybe I should quote Steven Balmer from Microsoft: 
http://video.google.com/videoplay?docid=8913084255008000794#

btw: I am slim and I am not sweating as much as he does. But he is 15 years 
older then me. Maybe in 15 years I might have that front building as well? :) I 
hope not. I like to shower and know that while doing it my feets get wet. Ohhh 
boy! A bumblebee probably needs a half day to orbit around this guy.


-- 
Kind Regards from Switzerland,

Stevan Bajić

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Dspam-devel mailing list
Dspam-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-devel

Reply via email to