I have a similar experience with eastern european languages. No matter how many times I re-train in TOE mode the messages from the same users always get classified as SPAM. What's worse, since I cannot re-train white-listing fails as well.
I am running in a shared group and manage to get about 98.6% accuracy rate:

               TP True Positives:           6578
               TN True Negatives:          29488
               FP False Positives:           257
               FN False Negatives:           250
               SC Spam Corpusfed:             12
               NC Nonspam Corpusfed:           1
               TL Training Left:               0
               SHR Spam Hit Rate          96.34%
               HSR Ham Strike Rate:        0.86%
               OCA Overall Accuracy:      98.61%


I know dspam tokenizes the emails and calculates probabilities, but I know these problems persist, the strange thing is the encoding is us-ascii just the lnaguages change from english. I know the info here does not help much to determine the problem, just wanted to know
if I was alone with these problems.

-Daniel.


Patrick T. Tsang wrote:
I think most people who never come across chinese don't know how chinese works.
The chinese I am talking about is BIG5,GB2312, or UTF-8 (better).
Before MS$ enterprises the whole world, I don't think we will give up GB2312 and BIG5 charset.

If you look at BIG5, and GB2312, they are using the same mapping table, or most likely the charset occupy the same address on the table. They are all 2-bytes ASCII code sharing the same charset address but with different encoding only.
There is no way for Dspam to see which is GB2312 and BIG5...
I cannot just spam GB2312 email since the BIG5 email will be "spamed" too.

I have seen too many cases of dspam failure to detect the correct encoding.

BTW, the most problem is the re-train process... no one here can tell...

Good luck
Patrick



----- Original Message ----- From: "Dov Zamir" <[EMAIL PROTECTED]>
To: "Patrick T. Tsang" <[EMAIL PROTECTED]>
Cc: "Kent Tong" <[EMAIL PROTECTED]>; <[email protected]>
Sent: Saturday, January 27, 2007 3:42 PM
Subject: Re: [dspam-users] won't learn?


ציטוט Patrick T. Tsang:
Hello Kent,

I have the same problem.
And, I give up Dspam already. The result is not good, and the maintenance is too difficult to deal with.

No one here can answer me the problem of re-learn...

I think Dspam got its good idea to handle spam, but it is not designed for chinese.
Patrick,

I don't think that is correct. DSPAM tokenizes the email, there is no concept of language. It works just fine with Hebrew for my setup, so why would it not work with Chinese?

Good luck
Patrick



----- Original Message ----- From: "Kent Tong" <[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Saturday, January 27, 2007 11:21 AM
Subject: Re: [dspam-users] won't learn?


Marcin Krol wrote:
1. Try looking up the DSPAM factors in the message headers,
(you can view full message by pressing Ctrl-U in Thunderbird
or F9 in The Bat), the headers may give you some clue?

I just found out even for a spam correctly identified as spam, if
I classify it again, it will say it's innocent. If I delete the
headers generated by dspam (including the "Received by:" headers
it and Cyrus generated), then it will classify it as spam.

However, for a spam that wasn't identified, even after training it,
dspam is still classifying its header-removed version as innocent.

2. Have you changed the default spam-probability algorithms
in dspam.conf? You could tweak those and see what changes.

No.

--
Kent Tong
Useful news for CIO's at http://www2.cpttm.org.mo/cyberlab/cio-news


_________________________________________________________________________ This message has been scanned by Kibbutz Beit Kama's Anti Virus software,
and is believed to be clean of any viruses.
_________________________________________________________________________



_________________________________________________________________________ This message has been scanned by Kibbutz Beit Kama's Anti Virus software,
and is believed to be clean of any viruses.
_________________________________________________________________________








!DSPAM:8,45bb28ab81291006769230!


Reply via email to