Re: Charset normalization issue (report, patch, and request)

Justin Mason Wed, 11 Jan 2006 09:56:19 -0800

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Motoharu Kubo writes:
> Thank you very much for your help.
> 
> Amazing, Bayes score for ham drastically decreased by my patch
> yesterday.  I tested the same mail text with old system and new system.
> Old system returnes BAYES_99, while new system returns BAYES_00!!
> 
> Although my patch is still imcomplete and bayes db on the new system is
> a mixture of old-style and new-style tokens, it is an excellent result.
> 
> Today I changed:
> 
> o to make splitter function to separate Kakasi processing and moved the
>   routine from Message/Node.pm to Message.pm.  It will be easier to
>   replace other program.  This function sipmply returns if text contains
>   no UTF-8 data, so loss of performance will be minimized for single
>   byte charsets.
> 
>   splitter is called from:
>      get_rendered_body_text_array()
>      get_visible_rendered_body_text_array()

Would it be possible to move this to Bayes.pm?   As noted, it's
Bayes-specific, and this is a more appropriate place.

> o bayes tokenization for long token.  Original code cuts every two bytes
>   from top of token.  As multibyte UTF-8 character has at least 3 bytes,
>   I modified to cut every UTF-8 character.
> 
>   I am afraid that this change is appropriate or not.

It may be better to entirely disable the feature that cuts 8-bit strings
into 2-byte pairs, if Kakasi is in use, since it was intended as a
low-cost way of generating approximate-tokenized word tokens for Asian
character sets, and Kakasi does that task more effectively.

- --j.

> I attached my newest patch.
> 
> > The patch you include below includes most of my change, but omits the
> > following hunk. Perhaps the lack of that change is your problem?
> > 
> > @@ -385,7 +411,7 @@
> > }
> > else {
> > $self->{rendered_type} = $self->{type};
> > - $self->{rendered} = $text;
> > + $self->{rendered} = $self->{visible_rendered} = $text;
> > }
> > }
> 
> My mistake.  I didn't see svn.  I included this hunk and deleted my
> modificatoin.  It works fine.
> 
> > The problem here is the "use bytes" pragma at the top of
> > Bayes.pm--you'll want to remove that. Removing it will have some
> > follow-on consequences--the "use bytes" pragma will probably also have
> > to be removed from BayesStore and the other Bayes-related modules. The
> > BayesStore subclasses probably will also have to be modified to become
> > UTF-8 aware, storing tokens in UTF-8 form.
> 
> I did not change because I think speed is another important factor for
> mail filter.
> 
> I inserted to check if data contains UTF-8 characters but it may not be
> accurate.  s/([\x20-\x7f])\xa0+([\x20-\x7f])/$1$2/g would be more
> accurate when using "use bytes" pragma.
> 
> Motoharu Kubo
> [EMAIL PROTECTED]
> part 2     text/x-patch              8846
> diff -uNr SpamAssassin.orig/Bayes.pm SpamAssassin/Bayes.pm
> --- SpamAssassin.orig/Bayes.pm        2005-08-12 09:38:47.000000000 +0900
> +++ SpamAssassin/Bayes.pm     2006-01-11 21:04:36.555264391 +0900
> @@ -345,7 +345,7 @@
>    # include quotes, .'s and -'s for URIs, and [$,]'s for Nigerian-scam 
> strings,
>    # and ISO-8859-15 alphas.  Do not split on @'s; better results keeping it.
>    # Some useful tokens: "$31,000,000" "www.clock-speed.net" "f*ck" "Hits!"
> -  tr/-A-Za-z0-9,[EMAIL PROTECTED]'"\$.\241-\377 / /cs;
> +  tr/-A-Za-z0-9,[EMAIL PROTECTED]'"\$.\200-\377 / /cs;
>  
>    # DO split on "..." or "--" or "---"; common formatting error resulting in
>    # hapaxes.  Keep the separator itself as a token, though, as long ones can
> @@ -411,11 +411,11 @@
>      # the domain ".net" appeared in the To header.
>      #
>      if ($len > MAX_TOKEN_LENGTH && $token !~ /\*/) {
> -      if (TOKENIZE_LONG_8BIT_SEQS_AS_TUPLES && $token =~ /[\xa0-\xff]{2}/) {
> +      if (TOKENIZE_LONG_8BIT_SEQS_AS_TUPLES && $token =~ 
> /[\xc0-\xff][\x80-\xbf]{2,}/) {
>       # Matt sez: "Could be asian? Autrijus suggested doing character ngrams,
>       # but I'm doing tuples to keep the dbs small(er)."  Sounds like a plan
>       # to me! (jm)
> -     while ($token =~ s/^(..?)//) {
> +     while ($token =~ s/^([\xc0-\xff][\x80-\xbf]{2,})//) {
>         push (@rettokens, "8:$1");
>       }
>       next;
> diff -uNr SpamAssassin.orig/HTML.pm SpamAssassin/HTML.pm
> --- SpamAssassin.orig/HTML.pm 2005-08-12 09:38:47.000000000 +0900
> +++ SpamAssassin/HTML.pm      2006-01-10 22:45:26.000000000 +0900
> @@ -742,7 +742,12 @@
>      }
>    }
>    else {
> -    $text =~ s/[ \t\n\r\f\x0b\xa0]+/ /g;
> +    if ( $text =~ /[\xc0-\xff][\x80-\xbf][\x80-\xbf]/ ) {
> +      $text =~ s/[ \t\n\r\f\x0b]+/ /g;
> +    }
> +    else {
> +      $text =~ s/[ \t\n\r\f\x0b\xa0]+/ /g;
> +    }
>      # trim leading whitespace if previous element was whitespace
>      if (@{ $self->{text} } &&
>       defined $self->{text_whitespace} &&
> diff -uNr SpamAssassin.orig/Message/Node.pm SpamAssassin/Message/Node.pm
> --- SpamAssassin.orig/Message/Node.pm 2005-08-12 09:38:46.000000000 +0900
> +++ SpamAssassin/Message/Node.pm      2006-01-11 21:08:33.547919446 +0900
> @@ -42,6 +42,8 @@
>  use Mail::SpamAssassin::HTML;
>  use Mail::SpamAssassin::Logger;
>  
> +our $normalize_supported = ( $] > 5.008004 && eval 'require 
> Encode::Detect::Detector' && eval 'require Encode' );
> +
>  =item new()
>  
>  Generates an empty Node object and returns it.  Typically only called
> @@ -342,6 +344,28 @@
>    return 0;
>  }
>  
> +sub _normalize {
> +  my ($data, $charset) = @_;
> +  return $data unless $normalize_supported;
> +  my $detected = Encode::Detect::Detector::detect($data);
> +  dbg("Detected charset ".($detected || 'none'));
> +
> +  my $converter;
> +
> +  if ($charset && ($detected || 'none') !~ 
> /^(?:UTF|EUC|ISO-2022|Shift_JIS|Big5|GB)/i) {
> +      dbg("Using labeled charset $charset");
> +      $converter = Encode::find_encoding($charset);
> +  }
> +
> +  $converter = Encode::find_encoding($detected) unless $converter || 
> !defined($detected);
> +
> +  return $data unless $converter;
> +
> +  dbg("Converting...");
> +
> +  return $converter->decode($data, 0);
> +}
> +
>  =item rendered()
>  
>  render_text() takes the given text/* type MIME part, and attempts to
> @@ -359,7 +383,7 @@
>    return(undef,undef) unless ( $self->{'type'} =~ /^text\b/i );
>  
>    if (!exists $self->{rendered}) {
> -    my $text = $self->decode();
> +    my $text = _normalize($self->decode(), $self->{charset});
>      my $raw = length($text);
>  
>      # render text/html always, or any other text|text/plain part as text/html
> @@ -386,7 +410,7 @@
>      }
>      else {
>        $self->{rendered_type} = $self->{type};
> -      $self->{rendered} = $text;
> +      $self->{rendered} = $self->{visible_rendered} = $text;
>      }
>    }
>  
> @@ -478,7 +502,7 @@
>  
>    if ( $cte eq 'B' ) {
>      # base 64 encoded
> -    return Mail::SpamAssassin::Util::base64_decode($data);
> +    $data = Mail::SpamAssassin::Util::base64_decode($data);
>    }
>    elsif ( $cte eq 'Q' ) {
>      # quoted printable
> @@ -486,12 +510,13 @@
>      # the RFC states that in the encoded text, "_" is equal to "=20"
>      $data =~ s/_/=20/g;
>  
> -    return Mail::SpamAssassin::Util::qp_decode($data);
> +    $data = Mail::SpamAssassin::Util::qp_decode($data);
>    }
>    else {
>      # not possible since the input has already been limited to 'B' and 'Q'
>      die "message: unknown encoding type '$cte' in RFC2047 header";
>    }
> +  return _normalize($data, $encoding);
>  }
>  
>  # Decode base64 and quoted-printable in headers according to RFC2047.
> @@ -505,15 +530,15 @@
>    $header =~ s/\n[ \t]+/\n /g;
>    $header =~ s/\r?\n//g;
>  
> -  return $header unless $header =~ /=\?/;
> -
>    # multiple encoded sections must ignore the interim whitespace.
>    # to avoid possible FPs with (\s+(?==\?))?, look for the whole RE
>    # separated by whitespace.
>    1 while ($header =~ 
> s/(=\?[\w_-]+\?[bqBQ]\?[^?]+\?=)\s+(=\?[\w_-]+\?[bqBQ]\?[^?]+\?=)/$1$2/g);
>  
> -  $header =~
> -    s/=\?([\w_-]+)\?([bqBQ])\?([^?]+)\?=/__decode_header($1, uc($2), $3)/ge;
> +  unless ($header =~
> +       s/=\?([\w_-]+)\?([bqBQ])\?([^?]+)\?=/__decode_header($1, uc($2), 
> $3)/ge) {
> +    $header = _normalize($header);
> +  }
>  
>    return $header;
>  }
> diff -uNr SpamAssassin.orig/Message.pm SpamAssassin/Message.pm
> --- SpamAssassin.orig/Message.pm      2005-09-14 11:07:31.000000000 +0900
> +++ SpamAssassin/Message.pm   2006-01-11 21:07:15.045589574 +0900
> @@ -760,6 +760,7 @@
>    # 0: content-type, 1: boundary, 2: charset, 3: filename
>    my @ct = 
> Mail::SpamAssassin::Util::parse_content_type($part_msg->header('content-type'));
>    $part_msg->{'type'} = $ct[0];
> +  $part_msg->{'charset'} = $ct[2];
>  
>    # multipart sections are required to have a boundary set ...  If this
>    # one doesn't, assume it's malformed and revert to text/plain
> @@ -871,12 +872,17 @@
>  
>    # whitespace handling (warning: small changes have large effects!)
>    $text =~ s/\n+\s*\n+/\f/gs;                # double newlines => form feed
> -  $text =~ tr/ \t\n\r\x0b\xa0/ /s;   # whitespace => space
> +  if ( $text =~ /[\xc0-\xff][\x80-\xbf][\x80-\xbf]/ ) {
> +    $text =~ tr/ \t\n\r\x0b/ /s;     # whitespace => space
> +  }
> +  else {
> +    $text =~ tr/ \t\n\r\x0b\xa0/ /s; # whitespace => space
> +  }
>    $text =~ tr/\f/\n/;                        # form feeds => newline
>    
>    # warn "message: $text";
>  
> -  my @textary = split_into_array_of_short_lines ($text);
> +  my @textary = split_into_array_of_short_lines (splitter($text));
>    $self->{text_rendered} = [EMAIL PROTECTED];
>  
>    return $self->{text_rendered};
> @@ -931,10 +937,15 @@
>  
>    # whitespace handling (warning: small changes have large effects!)
>    $text =~ s/\n+\s*\n+/\f/gs;                # double newlines => form feed
> -  $text =~ tr/ \t\n\r\x0b\xa0/ /s;   # whitespace => space
> +  if ( $text =~ /[\xc0-\xff][\x80-\xbf][\x80-\xbf]/ ) {
> +    $text =~ tr/ \t\n\r\x0b/ /s;     # whitespace => space
> +  }
> +  else {
> +    $text =~ tr/ \t\n\r\x0b\xa0/ /s; # whitespace => space
> +  }
>    $text =~ tr/\f/\n/;                        # form feeds => newline
>  
> -  my @textary = split_into_array_of_short_lines ($text);
> +  my @textary = split_into_array_of_short_lines (splitter($text));
>    $self->{text_visible_rendered} = [EMAIL PROTECTED];
>  
>    return $self->{text_visible_rendered};
> @@ -982,7 +993,13 @@
>  
>    # whitespace handling (warning: small changes have large effects!)
>    $text =~ s/\n+\s*\n+/\f/gs;                # double newlines => form feed
> -  $text =~ tr/ \t\n\r\x0b\xa0/ /s;   # whitespace => space
> +  if ( $text =~ /[\xc0-\xff][\x80-\xbf][\x80-\xbf]/ ) {
> +    $text =~ tr/ \t\n\r\x0b/ /s;     # whitespace => space
> +  }
> +  else {
> +    $text =~ tr/ \t\n\r\x0b\xa0/ /s; # whitespace => space
> +  }
> +  $text =~ tr/ \t\n\r\x0b/ /s;       # whitespace => space
>    $text =~ tr/\f/\n/;                        # form feeds => newline
>  
>    my @textary = split_into_array_of_short_lines ($text);
> @@ -1028,6 +1045,25 @@
>  
>  # ---------------------------------------------------------------------------
>  
> +sub splitter {
> +  my ($text) = @_;
> +
> +  if ( $text !~ /[\xc0-\xff][\x80-\xbf]{2,}/ ) { return $text; }
> +
> +  $text =~ s/([\xc0-\xff][\x80-\xbf]{2,})[ 
> \n]+([\xc0-\xff][\x80-\xbf]{2,})/$1$2/gs;
> +
> +  use Text::Kakasi;
> +  Text::Kakasi::getopt_argv('kakasi','-ieuc','-w');
> +
> +  my $res = Encode::encode("euc-jp",Encode::decode("utf8",$text));
> +  my $str = Text::Kakasi::do_kakasi($res);
> +  my $utf8= Encode::decode("euc-jp",$str);
> +
> +  return $utf8;
> +}
> +
> +# ---------------------------------------------------------------------------
> +
>  1;
>  
>  =back
> diff -uNr SpamAssassin.orig/Util/DependencyInfo.pm 
> SpamAssassin/Util/DependencyInfo.pm
> --- SpamAssassin.orig/Util/DependencyInfo.pm  2005-09-14 11:07:31.000000000 
> +0900
> +++ SpamAssassin/Util/DependencyInfo.pm       2006-01-10 22:45:26.000000000 
> +0900
> @@ -168,6 +168,12 @@
>    desc => 'The "sa-update" script requires this module to access compressed
>    update archive files.',
>  },
> +{
> +  module => 'Encode::Detect',
> +  version => '0.00',
> +  desc => 'If this module is installed, SpamAssassin will detect charsets
> +  and convert them into Unicode.',
> +},
>  );
>  
>  ###########################################################################
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFDxUafMJF5cimLx9ARAgduAJ9FwZN3Zs4c0HneoBh9Wrlptlr1FQCeIRBb
ALga6AvVU4T15EugNaAi1gQ=
=BkIK
-----END PGP SIGNATURE-----

Re: Charset normalization issue (report, patch, and request)

Reply via email to