Hallo all, have done the same test for TREC06 with the following result:
Messages over 4MB (my current MaxMessageSize): theia full # find ../data/ -type f -size +4M | wc -l 1 theia full # Messages having no body part: theia full # find ../data -type f | while read foo ; do if [ $(sed '1,/^$/d' ${foo} | wc -l) -lt 1 ] ; then echo ${foo} ; fi ; done | wc -l 411 theia full # Result when training the full index: Total messages in full index: 37'822 Total messages failing: 11 Failing percentage: 0.03% I fiddeled yesterday around with messages that have no body and I was wrong with my assumption that processing the message as a document would fix the issue. I tested with one message from TREC05 having just 4 recieved mail headers and classifying that message failed regardless of what setting I had in DataSource. I found then out that adding more mail headers solves the issue. What headers is not important. They could be anything. So the problem is not the absence of the body part but the choosing of the Tokenizer and other settings. I use OSB and degenerating the message with 4 recieve headers resulted in no tokenizable content. The problem btw was not OSB alone but the used "IgnoreHeader Received" option I have in dspam.conf. That just eliminated everything from those 4 recieve headers resulting in 0 tokens. And DSPAM can't classify if it does not have any tokens to do the computation. I am 100% sure that DSPAM 3.8.0 would as well fail with the same configuration. In fact I am sure that DSPAM 3.9.0 would do a better job in classifying mails since the new HTML stripper would do a better job in getting more tokenizable content out of HTML mails. So without testing I would say that DSPAM 3.9.0 should beat DSPAM 3.8.0. But again. I need to test that. ================================================= Now a new issue: ================================================= While processing TREC06 I looked at the included corpus with Chinese messages and I realized one thing: DSPAM is totally bad for that language. The problem is not the language it self. DSPAM is not aware of the language and does not need to be aware of it. But the big problem with those mails is that most of them have a huge block of text that has no word boundery. And DSPAM is attempting to split words in order to tokenize them and produce tokens. But since GB2312, BIG5, Thai and any other language that does not have real word boundery, DSPAM can't split. So what happens is that DSPAM is tokenizing many words into one token and this is totally useless. Let me try to illustrate the problem by using English text but explaining how that would be handled in languages that have no word boundery. English text: "To bail or not to bail?" Now in Chinese this text would not be separated. Chinese do not have space as word separator. They use characters/symbols instead of words. So imagine all the spaces removed: "Tobailornottobail?" And DSPAM is now tokenizing this single word. But in reallity this would be 6 words but DSPAM does not know anything about that. It does not know that the language uses symbols and not words. We need to fix that! The problem is that there is no real easy fix for it. Since those symbols/words could be anything from 1 to 4 characters long. It's hard for someone without reading the text to know when to break and when not (since the language anyway does not break symbols and any break would be artificial just for the tokenizer). I have read yesterday the specs from Unicode and GB2312 has a gazillion of characters/symbols. It's huge! And the bad thing: DSPAM has no Unicode handling. So we would need to implement that first and then use a method to transform GB2312, BIG5 or any other language that does not have word bounderies into Unicode and then try to break the symbols into small peaces of tokenizable content. A easy way could be to say that we break after 2 bytes. We could go on and break everything into 1, 2, 3 and 4 byte words. Without knowing what we break. Just fix break at 1, 2, 3 and 4 bytes. Or we could code the whole character set from Unicode into DSPAM for those languages and then break based on this (GB2312 has around 7445 symbols. Big5 is even bigger). btw: We are not the only one having this issue. Basically every search engine has that issue. I wonder how other Anti-Spam solutions handle that case? I have yesterday quickly prototyped a function in Java that breaks a stream of symbols into tokenizable words by following the rules outlined by Dr. Herong Yang. But that's just for GB2312 and the other languages might have other rules. I don't know if I should invest more time into that? I feel like being the wrong person for that. Someone from the DSPAM community using those languages should help here. There is so much to do and I can't handle all of that. We need more developers. Maybe I should quote Steven Balmer from Microsoft: http://video.google.com/videoplay?docid=8913084255008000794# btw: I am slim and I am not sweating as much as he does. But he is 15 years older then me. Maybe in 15 years I might have that front building as well? :) I hope not. I like to shower and know that while doing it my feets get wet. Ohhh boy! A bumblebee probably needs a half day to orbit around this guy. -- Kind Regards from Switzerland, Stevan Bajić ------------------------------------------------------------------------------ Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july _______________________________________________ Dspam-devel mailing list Dspam-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspam-devel