Re: Charset normalization issue (report, patch, and request)

Motoharu Kubo Sat, 14 Jan 2006 06:16:02 -0800

I measured how bayes database improved.  I choose 101 hams that got
BAYES_99 score with old bayes database (thus these mails were FP).  I
tested these with new bayes engine and got following result.  Test 1:
run with new database.  Test 2: sa-learned and tested again with new
database.  Test 3: re-run with old database.  New database learned 6500
spams and 12200 hams.  Old database learned 14300 spams and 250000 hams.


             Test 1    Test 2    Test 3
    BAYES_00    12        92         3
    BAYES_05     3         2         0
    BAYES_20     7         3         3
    BAYES_40     5         3         4
    BAYES_50    61         1        75
    BAYES_60    10         0         3
    BAYES_80     2         0         2
    BAYES_95     1         0        11

John Myers wrote:
> My experience shows that speed only becomes an issue when one ends up
> using Perl's UTF-8 regex engine to evaluate rules. In the case of Bayes,
> I believe correctness is more important. I would have to see a
> significant measured decrease in speed before considering sacrificing
> correctness for speed.

I made UTF-8 aware tokenize_line and measured process time of
tokenize().  Unicode-aware version took about 240ms and Bytes-oriented
version took only 9ms.

I do not have appropriate test data that \xa0 is inserted by HTML::Parse
or otehr.  My test data showed same result with two tokenize_line funcs.

> The fact that the Bayes code confuses A0 bytes in UTF-8 encoded
> characters with the U+00A0 character is one example of an issue that
> would be solved were the "use bytes" pragma removed. To be correct, the
> Bayes database should be storing all tokens in UTF-8, so they match
> regardless of how they are encoded.

Yes, adding and removing UTF-8 flag to string is necessary when go and
back between UTF-8 aware and byte oriented routines.

However, using utf8 mode, we can consider language specific aspects.
I added a code to remove one character hiragana/katakana/symbols token.


> I'm not yet convinced that tokenization belongs inside
> get_rendered_body_text_array() and
> get_visible_rendered_body_text_array(). I suspect the content preview,
> which uses get_rendered_body_text_array(), would look strange were it to
> be tokenized. I am using get_visible_rendered_body_text_array() for
> something which I'm not yet convinced needs tokenization. I think this
> area needs some field experience.

I removed splitter() and tested with e-mail which contains word split by
line break (word contains "\n").  Body test could not find this word.
I think language specific tokenization is necessary not only for bayes
but also other tests.

A friend of mine reminds me that there are second "normalization" issue
in Japanese.  There are two-byte version of alphanumeric and some symbol
characters in Japanese charsets.  We call this "zenkaku" and
corresponding 7-bit "hankaku".  Of course zenkaku version word doesn't
match hankaku version.

The following code is a part of "zenkaku-to-hankaku normalization".

  $text =~ tr/\x{ff10}-\x{ff19}/0-9/;
  $text =~ tr/\x{ff21}-\x{ff3a}/A-Z/;
  $text =~ tr/\x{ff41}-\x{ff5a}/a-z/;
  $text =~ tr/\x{2018}/`/;
  $text =~ tr/\x{2019}/'/;
        ...........

I think this normalization should be done before header and body tests.


============== patch to make tokenize_line UTF-8 aware =============
--- Bayes.pm.bytes      2006-01-14 20:50:44.000000000 +0900
+++ Bayes.pm    2006-01-14 20:51:02.000000000 +0900
@@ -342,10 +342,13 @@

   my @rettokens = ();

+  no bytes;
+  utf8::decode($_);
+
   # include quotes, .'s and -'s for URIs, and [$,]'s for Nigerian-scam
strings,   # and ISO-8859-15 alphas.  Do not split on @'s; better
results keeping it.
   # Some useful tokens: "$31,000,000" "www.clock-speed.net" "f*ck" "Hits!"
-  tr/-A-Za-z0-9,[EMAIL PROTECTED]'"\$.\200-\377 / /cs;
+  tr/#%&()+\/:;<=>?\[\\]^`{|}~/ /s;

   # DO split on "..." or "--" or "---"; common formatting error
resulting in
   # hapaxes.  Keep the separator itself as a token, though, as long
ones can
@@ -379,19 +388,24 @@

     # but extend the stop-list. These are squarely in the gray
     # area, and it just slows us down to record them.
-    next if $len < 3 ||
-       ($token =~ /^(?:a(?:nd|ny|ble|ll|re)|
-               m(?:uch|ost|ade|ore|ail|ake|ailing|any|ailto)|
-               t(?:his|he|ime|hrough|hat)|
-               w(?:hy|here|ork|orld|ith|ithout|eb)|
-               f(?:rom|or|ew)| e(?:ach|ven|mail)|
-               o(?:ne|ff|nly|wn|ut)| n(?:ow|ot|eed)|
-               s(?:uch|ame)| l(?:ook|ike|ong)|
-               y(?:ou|our|ou're)|
-               The|has|have|into|using|http|see|It's|it's|
-               number|just|both|come|years|right|know|already|
-               people|place|first|because|
-               And|give|year|information|can)$/x);
+    if ( $token =~ /^[\x00-\x7f]+$/ ) {
+      next if $len < 3 ||
+         ($token =~ /^(?:a(?:nd|ny|ble|ll|re)|
+                 m(?:uch|ost|ade|ore|ail|ake|ailing|any|ailto)|
+                 t(?:his|he|ime|hrough|hat)|
+                 w(?:hy|here|ork|orld|ith|ithout|eb)|
+                 f(?:rom|or|ew)| e(?:ach|ven|mail)|
+                 o(?:ne|ff|nly|wn|ut)| n(?:ow|ot|eed)|
+                 s(?:uch|ame)| l(?:ook|ike|ong)|
+                 y(?:ou|our|ou're)|
+                 The|has|have|into|using|http|see|It's|it's|
+                 number|just|both|come|years|right|know|already|
+                 people|place|first|because|
+                 And|give|year|information|can)$/x);
+    }
+    else {
+      next if $len < 2 && $token =~
/^[\p{InHiragana}\p{InKatakana}\x{3000}-\x{303f}]+$/;
+    }

     # are we in the body?  If so, apply some body-specific breakouts
     if ($region == 1 || $region == 2) {
@@ -456,9 +470,11 @@
       }
     }

+    utf8::encode($token);
     push (@rettokens, $tokprefix.$token);
   }

+  use bytes;
   return @rettokens;
 }




-- 
----------------------------------------------------------------------
久保  元治             (株)サードウェア
Motoharu Kubo          274-0815 千葉県船橋市西習志野3-39-8
[EMAIL PROTECTED]      URL: http://www.3ware.co.jp/
                       Phone: 047-496-3341  Fax: 047-496-3370
                       携帯:  090-6171-5545/090-8513-0246
 ★弊社からのメールはZ-Linuxメールフィルタで全数検査しています★

Re: Charset normalization issue (report, patch, and request)

Reply via email to