I have a similar experience with eastern european languages. No matter
how many times I re-train in TOE mode the messages
from the same users always get classified as SPAM. What's worse, since I
cannot re-train white-listing fails as well.
I am running in a shared group and manage to get about 98.6% accuracy rate:
TP True Positives: 6578
TN True Negatives: 29488
FP False Positives: 257
FN False Negatives: 250
SC Spam Corpusfed: 12
NC Nonspam Corpusfed: 1
TL Training Left: 0
SHR Spam Hit Rate 96.34%
HSR Ham Strike Rate: 0.86%
OCA Overall Accuracy: 98.61%
I know dspam tokenizes the emails and calculates probabilities, but I
know these problems persist, the strange thing is the encoding
is us-ascii just the lnaguages change from english. I know the info here
does not help much to determine the problem, just wanted to know
if I was alone with these problems.
-Daniel.
Patrick T. Tsang wrote:
I think most people who never come across chinese don't know how
chinese works.
The chinese I am talking about is BIG5,GB2312, or UTF-8 (better).
Before MS$ enterprises the whole world, I don't think we will give up
GB2312 and BIG5 charset.
If you look at BIG5, and GB2312, they are using the same mapping
table, or most likely the charset occupy the same address on the table.
They are all 2-bytes ASCII code sharing the same charset address but
with different encoding only.
There is no way for Dspam to see which is GB2312 and BIG5...
I cannot just spam GB2312 email since the BIG5 email will be "spamed"
too.
I have seen too many cases of dspam failure to detect the correct
encoding.
BTW, the most problem is the re-train process... no one here can tell...
Good luck
Patrick
----- Original Message ----- From: "Dov Zamir" <[EMAIL PROTECTED]>
To: "Patrick T. Tsang" <[EMAIL PROTECTED]>
Cc: "Kent Tong" <[EMAIL PROTECTED]>;
<[email protected]>
Sent: Saturday, January 27, 2007 3:42 PM
Subject: Re: [dspam-users] won't learn?
ציטוט Patrick T. Tsang:
Hello Kent,
I have the same problem.
And, I give up Dspam already. The result is not good, and the
maintenance is too difficult to deal with.
No one here can answer me the problem of re-learn...
I think Dspam got its good idea to handle spam, but it is not
designed for chinese.
Patrick,
I don't think that is correct. DSPAM tokenizes the email, there is no
concept of language. It works just fine with Hebrew for my setup, so
why would it not work with Chinese?
Good luck
Patrick
----- Original Message ----- From: "Kent Tong" <[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Saturday, January 27, 2007 11:21 AM
Subject: Re: [dspam-users] won't learn?
Marcin Krol wrote:
1. Try looking up the DSPAM factors in the message headers,
(you can view full message by pressing Ctrl-U in Thunderbird
or F9 in The Bat), the headers may give you some clue?
I just found out even for a spam correctly identified as spam, if
I classify it again, it will say it's innocent. If I delete the
headers generated by dspam (including the "Received by:" headers
it and Cyrus generated), then it will classify it as spam.
However, for a spam that wasn't identified, even after training it,
dspam is still classifying its header-removed version as innocent.
2. Have you changed the default spam-probability algorithms
in dspam.conf? You could tweak those and see what changes.
No.
--
Kent Tong
Useful news for CIO's at http://www2.cpttm.org.mo/cyberlab/cio-news
_________________________________________________________________________
This message has been scanned by Kibbutz Beit Kama's Anti Virus
software,
and is believed to be clean of any viruses.
_________________________________________________________________________
_________________________________________________________________________
This message has been scanned by Kibbutz Beit Kama's Anti Virus
software,
and is believed to be clean of any viruses.
_________________________________________________________________________
!DSPAM:8,45bb28ab81291006769230!