I measured how bayes database improved. I choose 101 hams that got
BAYES_99 score with old bayes database (thus these mails were FP). I
tested these with new bayes engine and got following result. Test 1:
run with new database. Test 2: sa-learned and tested again with new
database. Test 3: re-run with old database. New database learned 6500
spams and 12200 hams. Old database learned 14300 spams and 250000 hams.
Test 1 Test 2 Test 3
BAYES_00 12 92 3
BAYES_05 3 2 0
BAYES_20 7 3 3
BAYES_40 5 3 4
BAYES_50 61 1 75
BAYES_60 10 0 3
BAYES_80 2 0 2
BAYES_95 1 0 11
John Myers wrote:
> My experience shows that speed only becomes an issue when one ends up
> using Perl's UTF-8 regex engine to evaluate rules. In the case of Bayes,
> I believe correctness is more important. I would have to see a
> significant measured decrease in speed before considering sacrificing
> correctness for speed.
I made UTF-8 aware tokenize_line and measured process time of
tokenize(). Unicode-aware version took about 240ms and Bytes-oriented
version took only 9ms.
I do not have appropriate test data that \xa0 is inserted by HTML::Parse
or otehr. My test data showed same result with two tokenize_line funcs.
> The fact that the Bayes code confuses A0 bytes in UTF-8 encoded
> characters with the U+00A0 character is one example of an issue that
> would be solved were the "use bytes" pragma removed. To be correct, the
> Bayes database should be storing all tokens in UTF-8, so they match
> regardless of how they are encoded.
Yes, adding and removing UTF-8 flag to string is necessary when go and
back between UTF-8 aware and byte oriented routines.
However, using utf8 mode, we can consider language specific aspects.
I added a code to remove one character hiragana/katakana/symbols token.
> I'm not yet convinced that tokenization belongs inside
> get_rendered_body_text_array() and
> get_visible_rendered_body_text_array(). I suspect the content preview,
> which uses get_rendered_body_text_array(), would look strange were it to
> be tokenized. I am using get_visible_rendered_body_text_array() for
> something which I'm not yet convinced needs tokenization. I think this
> area needs some field experience.
I removed splitter() and tested with e-mail which contains word split by
line break (word contains "\n"). Body test could not find this word.
I think language specific tokenization is necessary not only for bayes
but also other tests.
A friend of mine reminds me that there are second "normalization" issue
in Japanese. There are two-byte version of alphanumeric and some symbol
characters in Japanese charsets. We call this "zenkaku" and
corresponding 7-bit "hankaku". Of course zenkaku version word doesn't
match hankaku version.
The following code is a part of "zenkaku-to-hankaku normalization".
$text =~ tr/\x{ff10}-\x{ff19}/0-9/;
$text =~ tr/\x{ff21}-\x{ff3a}/A-Z/;
$text =~ tr/\x{ff41}-\x{ff5a}/a-z/;
$text =~ tr/\x{2018}/`/;
$text =~ tr/\x{2019}/'/;
...........
I think this normalization should be done before header and body tests.
============== patch to make tokenize_line UTF-8 aware =============
--- Bayes.pm.bytes 2006-01-14 20:50:44.000000000 +0900
+++ Bayes.pm 2006-01-14 20:51:02.000000000 +0900
@@ -342,10 +342,13 @@
my @rettokens = ();
+ no bytes;
+ utf8::decode($_);
+
# include quotes, .'s and -'s for URIs, and [$,]'s for Nigerian-scam
strings, # and ISO-8859-15 alphas. Do not split on @'s; better
results keeping it.
# Some useful tokens: "$31,000,000" "www.clock-speed.net" "f*ck" "Hits!"
- tr/-A-Za-z0-9,[EMAIL PROTECTED]'"\$.\200-\377 / /cs;
+ tr/#%&()+\/:;<=>?\[\\]^`{|}~/ /s;
# DO split on "..." or "--" or "---"; common formatting error
resulting in
# hapaxes. Keep the separator itself as a token, though, as long
ones can
@@ -379,19 +388,24 @@
# but extend the stop-list. These are squarely in the gray
# area, and it just slows us down to record them.
- next if $len < 3 ||
- ($token =~ /^(?:a(?:nd|ny|ble|ll|re)|
- m(?:uch|ost|ade|ore|ail|ake|ailing|any|ailto)|
- t(?:his|he|ime|hrough|hat)|
- w(?:hy|here|ork|orld|ith|ithout|eb)|
- f(?:rom|or|ew)| e(?:ach|ven|mail)|
- o(?:ne|ff|nly|wn|ut)| n(?:ow|ot|eed)|
- s(?:uch|ame)| l(?:ook|ike|ong)|
- y(?:ou|our|ou're)|
- The|has|have|into|using|http|see|It's|it's|
- number|just|both|come|years|right|know|already|
- people|place|first|because|
- And|give|year|information|can)$/x);
+ if ( $token =~ /^[\x00-\x7f]+$/ ) {
+ next if $len < 3 ||
+ ($token =~ /^(?:a(?:nd|ny|ble|ll|re)|
+ m(?:uch|ost|ade|ore|ail|ake|ailing|any|ailto)|
+ t(?:his|he|ime|hrough|hat)|
+ w(?:hy|here|ork|orld|ith|ithout|eb)|
+ f(?:rom|or|ew)| e(?:ach|ven|mail)|
+ o(?:ne|ff|nly|wn|ut)| n(?:ow|ot|eed)|
+ s(?:uch|ame)| l(?:ook|ike|ong)|
+ y(?:ou|our|ou're)|
+ The|has|have|into|using|http|see|It's|it's|
+ number|just|both|come|years|right|know|already|
+ people|place|first|because|
+ And|give|year|information|can)$/x);
+ }
+ else {
+ next if $len < 2 && $token =~
/^[\p{InHiragana}\p{InKatakana}\x{3000}-\x{303f}]+$/;
+ }
# are we in the body? If so, apply some body-specific breakouts
if ($region == 1 || $region == 2) {
@@ -456,9 +470,11 @@
}
}
+ utf8::encode($token);
push (@rettokens, $tokprefix.$token);
}
+ use bytes;
return @rettokens;
}
--
----------------------------------------------------------------------
久保 元治 (株)サードウェア
Motoharu Kubo 274-0815 千葉県船橋市西習志野3-39-8
[EMAIL PROTECTED] URL: http://www.3ware.co.jp/
Phone: 047-496-3341 Fax: 047-496-3370
携帯: 090-6171-5545/090-8513-0246
★弊社からのメールはZ-Linuxメールフィルタで全数検査しています★