Re: Charset normalization issue (report, patch, and request)

Motoharu Kubo Tue, 10 Jan 2006 06:21:58 -0800

"John Myers" writes:
>>I must say I was quite pleasantly surprised to find my change tested so 
>>quickly during a weekend.


That's why your patch and proposal is just what I wanted and I have been
searching.

"Justin Mason" writes:
> We could make it dependent on TextCat's language identification... if
> language is "ja", then apply the Kakasi tokenizer, if available.

This is an excellent idea.

I do not think that Kakasi is the best choice, as John and you said.
MeCab has Perl interface and cah handle UTF-8.  Its license is
tentatively LGPL.  The program does not contain dictionary.  So users
have to download dictionary but it is not a problem.

>>I believe tokenization should be done in Bayes, not in Message::Node.  I 
>>believe tests should be run against the non-tokenized form.
> 
> 
> +1 agreed.

As first byte of UTF-8 always have second bit set, there is no mismatch
problem I experienced with iso-2022-jp. Thus I agree with you also.

However, there is another issue that I did not write so far.  In
Japanese and some asian language word can be split without hyphenation.
Joining lines with space cause problem.  Not joining lines can cause
important but undetected keyword because of line break.  I am
considering this issue right now.

The most time consuming but accurate approach would be tokenize in
do_body_test if language is "ja" and contents-type is "text/plain"

>>>(2) Raw text body is passed to Bayes tokenizer.  This causes some
>>>    difficulties.
>>
>>My reading of the Bayes code suggests the "visible rendered" form of the 
>>body is what is passed to the Bayes tokenizer.  But then I don't use 
>>Bayes so haven't seen what really happens.
> 
> 
> Yes, that is the intent (and what happens with english text, at least).

I checked the code and found that bayes receives normalized header text
and non-normalized body test.

If bayes should receive and can handle normalized body text
get_visible_rendered_body_text_array() should be modified.

In this function, content of text/plain part is gotten by calling
$p->decode() which returns non-normalized text.  Changing this to
"$p->rendered(); $text .= $p->{rendered};" seems to work fine.  Bayes
receives normalized text.

In addition, \xa0 is considered as whitespace but UTF-8 can contain this
character as second or third byte.  The tokenize_line cuts \200-\x240.
I also changed these problems and bayes seems to receive normalized
text.

I made a patch.  Text::Kakasi is still used in this patch, so this
patch is also an experimental.  I will test it for a while.

Any help, suggestion, objection, and warning is greatly appreciated.

======================= patch begins ===============================
diff -uNr SpamAssassin.orig/Bayes.pm SpamAssassin/Bayes.pm
--- SpamAssassin.orig/Bayes.pm  2005-08-12 09:38:47.000000000 +0900
+++ SpamAssassin/Bayes.pm       2006-01-10 22:40:14.031120448 +0900
@@ -345,7 +345,7 @@
   # include quotes, .'s and -'s for URIs, and [$,]'s for Nigerian-scam
strings,
   # and ISO-8859-15 alphas.  Do not split on @'s; better results
keeping it.
   # Some useful tokens: "$31,000,000" "www.clock-speed.net" "f*ck" "Hits!"
-  tr/-A-Za-z0-9,[EMAIL PROTECTED]'"\$.\241-\377 / /cs;
+  tr/-A-Za-z0-9,[EMAIL PROTECTED]'"\$.\200-\377 / /cs;

   # DO split on "..." or "--" or "---"; common formatting error
resulting in
   # hapaxes.  Keep the separator itself as a token, though, as long
ones can
diff -uNr SpamAssassin.orig/HTML.pm SpamAssassin/HTML.pm
--- SpamAssassin.orig/HTML.pm   2005-08-12 09:38:47.000000000 +0900
+++ SpamAssassin/HTML.pm        2006-01-10 22:39:01.662418537 +0900
@@ -742,7 +742,12 @@
     }
   }
   else {
-    $text =~ s/[ \t\n\r\f\x0b\xa0]+/ /g;
+    if ( $text =~ /[\xc0-\xff][\x80-\xbf][\x80-\xbf]/ ) {
+      $text =~ s/[ \t\n\r\f\x0b]+/ /g;
+    }
+    else {
+      $text =~ s/[ \t\n\r\f\x0b\xa0]+/ /g;
+    }
     # trim leading whitespace if previous element was whitespace
     if (@{ $self->{text} } &&
        defined $self->{text_whitespace} &&
diff -uNr SpamAssassin.orig/Message/Node.pm SpamAssassin/Message/Node.pm
--- SpamAssassin.orig/Message/Node.pm   2005-08-12 09:38:46.000000000 +0900
+++ SpamAssassin/Message/Node.pm        2006-01-10 22:44:27.254093218 +0900
@@ -42,6 +42,8 @@
 use Mail::SpamAssassin::HTML;
 use Mail::SpamAssassin::Logger;

+our $normalize_supported = ( $] > 5.008004 && eval 'require
Encode::Detect::Detector' && eval 'require Encode' );
+
 =item new()

 Generates an empty Node object and returns it.  Typically only called
@@ -342,6 +344,33 @@
   return 0;
 }

+sub _normalize {
+  my ($data, $charset) = @_;
+  return $data unless $normalize_supported;
+  my $detected = Encode::Detect::Detector::detect($data);
+  dbg("Detected charset ".($detected || 'none'));
+
+  my $converter;
+
+  if ($charset && ($detected || 'none') !~
/^(?:UTF|EUC|ISO-2022|Shift_JIS|Big5|GB)/i) {
+      dbg("Using labeled charset $charset");
+      $converter = Encode::find_encoding($charset);
+  }
+
+  $converter = Encode::find_encoding($detected) unless $converter ||
!defined($detected);
+
+  return $data unless $converter;
+
+  dbg("Converting...");
+
+  use Text::Kakasi;
+  my $res = Encode::encode("euc-jp",$converter->decode($data, 0));
+  my $rc  = Text::Kakasi::getopt_argv('kakasi','-ieuc','-w');
+  my $str = Text::Kakasi::do_kakasi($res);
+  my $utf8= Encode::decode("euc-jp",$str);
+  return $utf8;
+}
+
 =item rendered()

 render_text() takes the given text/* type MIME part, and attempts to
@@ -359,7 +388,7 @@
   return(undef,undef) unless ( $self->{'type'} =~ /^text\b/i );

   if (!exists $self->{rendered}) {
-    my $text = $self->decode();
+    my $text = _normalize($self->decode(), $self->{charset});
     my $raw = length($text);

     # render text/html always, or any other text|text/plain part as
text/html
@@ -478,7 +507,7 @@

   if ( $cte eq 'B' ) {
     # base 64 encoded
-    return Mail::SpamAssassin::Util::base64_decode($data);
+    $data = Mail::SpamAssassin::Util::base64_decode($data);
   }
   elsif ( $cte eq 'Q' ) {
     # quoted printable
@@ -486,12 +515,13 @@
     # the RFC states that in the encoded text, "_" is equal to "=20"
     $data =~ s/_/=20/g;

-    return Mail::SpamAssassin::Util::qp_decode($data);
+    $data = Mail::SpamAssassin::Util::qp_decode($data);
   }
   else {
     # not possible since the input has already been limited to 'B' and 'Q'
     die "message: unknown encoding type '$cte' in RFC2047 header";
   }
+  return _normalize($data, $encoding);
 }

 # Decode base64 and quoted-printable in headers according to RFC2047.
@@ -505,15 +535,15 @@
   $header =~ s/\n[ \t]+/\n /g;
   $header =~ s/\r?\n//g;

-  return $header unless $header =~ /=\?/;
-
   # multiple encoded sections must ignore the interim whitespace.
   # to avoid possible FPs with (\s+(?==\?))?, look for the whole RE
   # separated by whitespace.
   1 while ($header =~
s/(=\?[\w_-]+\?[bqBQ]\?[^?]+\?=)\s+(=\?[\w_-]+\?[bqBQ]\?[^?]+\?=)/$1$2/g);

-  $header =~
-    s/=\?([\w_-]+)\?([bqBQ])\?([^?]+)\?=/__decode_header($1, uc($2),
$3)/ge;
+  unless ($header =~
+         s/=\?([\w_-]+)\?([bqBQ])\?([^?]+)\?=/__decode_header($1, uc($2),
$3)/ge) {
+    $header = _normalize($header);
+  }

   return $header;
 }
diff -uNr SpamAssassin.orig/Message.pm SpamAssassin/Message.pm
--- SpamAssassin.orig/Message.pm        2005-09-14 11:07:31.000000000 +0900
+++ SpamAssassin/Message.pm     2006-01-10 22:42:22.388213543 +0900
@@ -760,6 +760,7 @@
   # 0: content-type, 1: boundary, 2: charset, 3: filename
   my @ct =
Mail::SpamAssassin::Util::parse_content_type($part_msg->header('content-type'));
   $part_msg->{'type'} = $ct[0];
+  $part_msg->{'charset'} = $ct[2];

   # multipart sections are required to have a boundary set ...  If this
   # one doesn't, assume it's malformed and revert to text/plain
@@ -871,7 +872,12 @@

   # whitespace handling (warning: small changes have large effects!)
   $text =~ s/\n+\s*\n+/\f/gs;          # double newlines => form feed
-  $text =~ tr/ \t\n\r\x0b\xa0/ /s;     # whitespace => space
+  if ( $text =~ /[\xc0-\xff][\x80-\xbf][\x80-\xbf]/ ) {
+    $text =~ tr/ \t\n\r\x0b/ /s;       # whitespace => space
+  }
+  else {
+    $text =~ tr/ \t\n\r\x0b\xa0/ /s;   # whitespace => space
+  }
   $text =~ tr/\f/\n/;                  # form feeds => newline

   # warn "message: $text";
@@ -925,13 +931,19 @@
       }
     }
     else {
-      $text .= $p->decode();
+      $p->rendered();
+      $text .= $p->{rendered}  
     }
   }

   # whitespace handling (warning: small changes have large effects!)
   $text =~ s/\n+\s*\n+/\f/gs;          # double newlines => form feed
-  $text =~ tr/ \t\n\r\x0b\xa0/ /s;     # whitespace => space
+  if ( $text =~ /[\xc0-\xff][\x80-\xbf][\x80-\xbf]/ ) {
+    $text =~ tr/ \t\n\r\x0b/ /s;       # whitespace => space
+  }
+  else {
+    $text =~ tr/ \t\n\r\x0b\xa0/ /s;   # whitespace => space
+  }
   $text =~ tr/\f/\n/;                  # form feeds => newline

   my @textary = split_into_array_of_short_lines ($text);
@@ -982,7 +994,13 @@

   # whitespace handling (warning: small changes have large effects!)
   $text =~ s/\n+\s*\n+/\f/gs;          # double newlines => form feed
-  $text =~ tr/ \t\n\r\x0b\xa0/ /s;     # whitespace => space
+  if ( $text =~ /[\xc0-\xff][\x80-\xbf][\x80-\xbf]/ ) {
+    $text =~ tr/ \t\n\r\x0b/ /s;       # whitespace => space
+  }
+  else {
+    $text =~ tr/ \t\n\r\x0b\xa0/ /s;   # whitespace => space
+  }
+  $text =~ tr/ \t\n\r\x0b/ /s; # whitespace => space
   $text =~ tr/\f/\n/;                  # form feeds => newline

   my @textary = split_into_array_of_short_lines ($text);
diff -uNr SpamAssassin.orig/Util/DependencyInfo.pm
SpamAssassin/Util/DependencyInfo.pm
--- SpamAssassin.orig/Util/DependencyInfo.pm    2005-09-14
11:07:31.000000000 +0900
+++ SpamAssassin/Util/DependencyInfo.pm 2006-01-10 22:39:01.666417637 +0900
@@ -168,6 +168,12 @@
   desc => 'The "sa-update" script requires this module to access compressed
   update archive files.',
 },
+{
+  module => 'Encode::Detect',
+  version => '0.00',
+  desc => 'If this module is installed, SpamAssassin will detect charsets
+  and convert them into Unicode.',
+},
 );

 ###########################################################################



-- 
----------------------------------------------------------------------
Motoharu Kubo
[EMAIL PROTECTED]

Re: Charset normalization issue (report, patch, and request)

Reply via email to