-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Motoharu Kubo writes:
> Thank you very much for your help.
>
> Amazing, Bayes score for ham drastically decreased by my patch
> yesterday. I tested the same mail text with old system and new system.
> Old system returnes BAYES_99, while new system returns BAYES_00!!
>
> Although my patch is still imcomplete and bayes db on the new system is
> a mixture of old-style and new-style tokens, it is an excellent result.
>
> Today I changed:
>
> o to make splitter function to separate Kakasi processing and moved the
> routine from Message/Node.pm to Message.pm. It will be easier to
> replace other program. This function sipmply returns if text contains
> no UTF-8 data, so loss of performance will be minimized for single
> byte charsets.
>
> splitter is called from:
> get_rendered_body_text_array()
> get_visible_rendered_body_text_array()
Would it be possible to move this to Bayes.pm? As noted, it's
Bayes-specific, and this is a more appropriate place.
> o bayes tokenization for long token. Original code cuts every two bytes
> from top of token. As multibyte UTF-8 character has at least 3 bytes,
> I modified to cut every UTF-8 character.
>
> I am afraid that this change is appropriate or not.
It may be better to entirely disable the feature that cuts 8-bit strings
into 2-byte pairs, if Kakasi is in use, since it was intended as a
low-cost way of generating approximate-tokenized word tokens for Asian
character sets, and Kakasi does that task more effectively.
- --j.
> I attached my newest patch.
>
> > The patch you include below includes most of my change, but omits the
> > following hunk. Perhaps the lack of that change is your problem?
> >
> > @@ -385,7 +411,7 @@
> > }
> > else {
> > $self->{rendered_type} = $self->{type};
> > - $self->{rendered} = $text;
> > + $self->{rendered} = $self->{visible_rendered} = $text;
> > }
> > }
>
> My mistake. I didn't see svn. I included this hunk and deleted my
> modificatoin. It works fine.
>
> > The problem here is the "use bytes" pragma at the top of
> > Bayes.pm--you'll want to remove that. Removing it will have some
> > follow-on consequences--the "use bytes" pragma will probably also have
> > to be removed from BayesStore and the other Bayes-related modules. The
> > BayesStore subclasses probably will also have to be modified to become
> > UTF-8 aware, storing tokens in UTF-8 form.
>
> I did not change because I think speed is another important factor for
> mail filter.
>
> I inserted to check if data contains UTF-8 characters but it may not be
> accurate. s/([\x20-\x7f])\xa0+([\x20-\x7f])/$1$2/g would be more
> accurate when using "use bytes" pragma.
>
> Motoharu Kubo
> [EMAIL PROTECTED]
> part 2 text/x-patch 8846
> diff -uNr SpamAssassin.orig/Bayes.pm SpamAssassin/Bayes.pm
> --- SpamAssassin.orig/Bayes.pm 2005-08-12 09:38:47.000000000 +0900
> +++ SpamAssassin/Bayes.pm 2006-01-11 21:04:36.555264391 +0900
> @@ -345,7 +345,7 @@
> # include quotes, .'s and -'s for URIs, and [$,]'s for Nigerian-scam
> strings,
> # and ISO-8859-15 alphas. Do not split on @'s; better results keeping it.
> # Some useful tokens: "$31,000,000" "www.clock-speed.net" "f*ck" "Hits!"
> - tr/-A-Za-z0-9,[EMAIL PROTECTED]'"\$.\241-\377 / /cs;
> + tr/-A-Za-z0-9,[EMAIL PROTECTED]'"\$.\200-\377 / /cs;
>
> # DO split on "..." or "--" or "---"; common formatting error resulting in
> # hapaxes. Keep the separator itself as a token, though, as long ones can
> @@ -411,11 +411,11 @@
> # the domain ".net" appeared in the To header.
> #
> if ($len > MAX_TOKEN_LENGTH && $token !~ /\*/) {
> - if (TOKENIZE_LONG_8BIT_SEQS_AS_TUPLES && $token =~ /[\xa0-\xff]{2}/) {
> + if (TOKENIZE_LONG_8BIT_SEQS_AS_TUPLES && $token =~
> /[\xc0-\xff][\x80-\xbf]{2,}/) {
> # Matt sez: "Could be asian? Autrijus suggested doing character ngrams,
> # but I'm doing tuples to keep the dbs small(er)." Sounds like a plan
> # to me! (jm)
> - while ($token =~ s/^(..?)//) {
> + while ($token =~ s/^([\xc0-\xff][\x80-\xbf]{2,})//) {
> push (@rettokens, "8:$1");
> }
> next;
> diff -uNr SpamAssassin.orig/HTML.pm SpamAssassin/HTML.pm
> --- SpamAssassin.orig/HTML.pm 2005-08-12 09:38:47.000000000 +0900
> +++ SpamAssassin/HTML.pm 2006-01-10 22:45:26.000000000 +0900
> @@ -742,7 +742,12 @@
> }
> }
> else {
> - $text =~ s/[ \t\n\r\f\x0b\xa0]+/ /g;
> + if ( $text =~ /[\xc0-\xff][\x80-\xbf][\x80-\xbf]/ ) {
> + $text =~ s/[ \t\n\r\f\x0b]+/ /g;
> + }
> + else {
> + $text =~ s/[ \t\n\r\f\x0b\xa0]+/ /g;
> + }
> # trim leading whitespace if previous element was whitespace
> if (@{ $self->{text} } &&
> defined $self->{text_whitespace} &&
> diff -uNr SpamAssassin.orig/Message/Node.pm SpamAssassin/Message/Node.pm
> --- SpamAssassin.orig/Message/Node.pm 2005-08-12 09:38:46.000000000 +0900
> +++ SpamAssassin/Message/Node.pm 2006-01-11 21:08:33.547919446 +0900
> @@ -42,6 +42,8 @@
> use Mail::SpamAssassin::HTML;
> use Mail::SpamAssassin::Logger;
>
> +our $normalize_supported = ( $] > 5.008004 && eval 'require
> Encode::Detect::Detector' && eval 'require Encode' );
> +
> =item new()
>
> Generates an empty Node object and returns it. Typically only called
> @@ -342,6 +344,28 @@
> return 0;
> }
>
> +sub _normalize {
> + my ($data, $charset) = @_;
> + return $data unless $normalize_supported;
> + my $detected = Encode::Detect::Detector::detect($data);
> + dbg("Detected charset ".($detected || 'none'));
> +
> + my $converter;
> +
> + if ($charset && ($detected || 'none') !~
> /^(?:UTF|EUC|ISO-2022|Shift_JIS|Big5|GB)/i) {
> + dbg("Using labeled charset $charset");
> + $converter = Encode::find_encoding($charset);
> + }
> +
> + $converter = Encode::find_encoding($detected) unless $converter ||
> !defined($detected);
> +
> + return $data unless $converter;
> +
> + dbg("Converting...");
> +
> + return $converter->decode($data, 0);
> +}
> +
> =item rendered()
>
> render_text() takes the given text/* type MIME part, and attempts to
> @@ -359,7 +383,7 @@
> return(undef,undef) unless ( $self->{'type'} =~ /^text\b/i );
>
> if (!exists $self->{rendered}) {
> - my $text = $self->decode();
> + my $text = _normalize($self->decode(), $self->{charset});
> my $raw = length($text);
>
> # render text/html always, or any other text|text/plain part as text/html
> @@ -386,7 +410,7 @@
> }
> else {
> $self->{rendered_type} = $self->{type};
> - $self->{rendered} = $text;
> + $self->{rendered} = $self->{visible_rendered} = $text;
> }
> }
>
> @@ -478,7 +502,7 @@
>
> if ( $cte eq 'B' ) {
> # base 64 encoded
> - return Mail::SpamAssassin::Util::base64_decode($data);
> + $data = Mail::SpamAssassin::Util::base64_decode($data);
> }
> elsif ( $cte eq 'Q' ) {
> # quoted printable
> @@ -486,12 +510,13 @@
> # the RFC states that in the encoded text, "_" is equal to "=20"
> $data =~ s/_/=20/g;
>
> - return Mail::SpamAssassin::Util::qp_decode($data);
> + $data = Mail::SpamAssassin::Util::qp_decode($data);
> }
> else {
> # not possible since the input has already been limited to 'B' and 'Q'
> die "message: unknown encoding type '$cte' in RFC2047 header";
> }
> + return _normalize($data, $encoding);
> }
>
> # Decode base64 and quoted-printable in headers according to RFC2047.
> @@ -505,15 +530,15 @@
> $header =~ s/\n[ \t]+/\n /g;
> $header =~ s/\r?\n//g;
>
> - return $header unless $header =~ /=\?/;
> -
> # multiple encoded sections must ignore the interim whitespace.
> # to avoid possible FPs with (\s+(?==\?))?, look for the whole RE
> # separated by whitespace.
> 1 while ($header =~
> s/(=\?[\w_-]+\?[bqBQ]\?[^?]+\?=)\s+(=\?[\w_-]+\?[bqBQ]\?[^?]+\?=)/$1$2/g);
>
> - $header =~
> - s/=\?([\w_-]+)\?([bqBQ])\?([^?]+)\?=/__decode_header($1, uc($2), $3)/ge;
> + unless ($header =~
> + s/=\?([\w_-]+)\?([bqBQ])\?([^?]+)\?=/__decode_header($1, uc($2),
> $3)/ge) {
> + $header = _normalize($header);
> + }
>
> return $header;
> }
> diff -uNr SpamAssassin.orig/Message.pm SpamAssassin/Message.pm
> --- SpamAssassin.orig/Message.pm 2005-09-14 11:07:31.000000000 +0900
> +++ SpamAssassin/Message.pm 2006-01-11 21:07:15.045589574 +0900
> @@ -760,6 +760,7 @@
> # 0: content-type, 1: boundary, 2: charset, 3: filename
> my @ct =
> Mail::SpamAssassin::Util::parse_content_type($part_msg->header('content-type'));
> $part_msg->{'type'} = $ct[0];
> + $part_msg->{'charset'} = $ct[2];
>
> # multipart sections are required to have a boundary set ... If this
> # one doesn't, assume it's malformed and revert to text/plain
> @@ -871,12 +872,17 @@
>
> # whitespace handling (warning: small changes have large effects!)
> $text =~ s/\n+\s*\n+/\f/gs; # double newlines => form feed
> - $text =~ tr/ \t\n\r\x0b\xa0/ /s; # whitespace => space
> + if ( $text =~ /[\xc0-\xff][\x80-\xbf][\x80-\xbf]/ ) {
> + $text =~ tr/ \t\n\r\x0b/ /s; # whitespace => space
> + }
> + else {
> + $text =~ tr/ \t\n\r\x0b\xa0/ /s; # whitespace => space
> + }
> $text =~ tr/\f/\n/; # form feeds => newline
>
> # warn "message: $text";
>
> - my @textary = split_into_array_of_short_lines ($text);
> + my @textary = split_into_array_of_short_lines (splitter($text));
> $self->{text_rendered} = [EMAIL PROTECTED];
>
> return $self->{text_rendered};
> @@ -931,10 +937,15 @@
>
> # whitespace handling (warning: small changes have large effects!)
> $text =~ s/\n+\s*\n+/\f/gs; # double newlines => form feed
> - $text =~ tr/ \t\n\r\x0b\xa0/ /s; # whitespace => space
> + if ( $text =~ /[\xc0-\xff][\x80-\xbf][\x80-\xbf]/ ) {
> + $text =~ tr/ \t\n\r\x0b/ /s; # whitespace => space
> + }
> + else {
> + $text =~ tr/ \t\n\r\x0b\xa0/ /s; # whitespace => space
> + }
> $text =~ tr/\f/\n/; # form feeds => newline
>
> - my @textary = split_into_array_of_short_lines ($text);
> + my @textary = split_into_array_of_short_lines (splitter($text));
> $self->{text_visible_rendered} = [EMAIL PROTECTED];
>
> return $self->{text_visible_rendered};
> @@ -982,7 +993,13 @@
>
> # whitespace handling (warning: small changes have large effects!)
> $text =~ s/\n+\s*\n+/\f/gs; # double newlines => form feed
> - $text =~ tr/ \t\n\r\x0b\xa0/ /s; # whitespace => space
> + if ( $text =~ /[\xc0-\xff][\x80-\xbf][\x80-\xbf]/ ) {
> + $text =~ tr/ \t\n\r\x0b/ /s; # whitespace => space
> + }
> + else {
> + $text =~ tr/ \t\n\r\x0b\xa0/ /s; # whitespace => space
> + }
> + $text =~ tr/ \t\n\r\x0b/ /s; # whitespace => space
> $text =~ tr/\f/\n/; # form feeds => newline
>
> my @textary = split_into_array_of_short_lines ($text);
> @@ -1028,6 +1045,25 @@
>
> # ---------------------------------------------------------------------------
>
> +sub splitter {
> + my ($text) = @_;
> +
> + if ( $text !~ /[\xc0-\xff][\x80-\xbf]{2,}/ ) { return $text; }
> +
> + $text =~ s/([\xc0-\xff][\x80-\xbf]{2,})[
> \n]+([\xc0-\xff][\x80-\xbf]{2,})/$1$2/gs;
> +
> + use Text::Kakasi;
> + Text::Kakasi::getopt_argv('kakasi','-ieuc','-w');
> +
> + my $res = Encode::encode("euc-jp",Encode::decode("utf8",$text));
> + my $str = Text::Kakasi::do_kakasi($res);
> + my $utf8= Encode::decode("euc-jp",$str);
> +
> + return $utf8;
> +}
> +
> +# ---------------------------------------------------------------------------
> +
> 1;
>
> =back
> diff -uNr SpamAssassin.orig/Util/DependencyInfo.pm
> SpamAssassin/Util/DependencyInfo.pm
> --- SpamAssassin.orig/Util/DependencyInfo.pm 2005-09-14 11:07:31.000000000
> +0900
> +++ SpamAssassin/Util/DependencyInfo.pm 2006-01-10 22:45:26.000000000
> +0900
> @@ -168,6 +168,12 @@
> desc => 'The "sa-update" script requires this module to access compressed
> update archive files.',
> },
> +{
> + module => 'Encode::Detect',
> + version => '0.00',
> + desc => 'If this module is installed, SpamAssassin will detect charsets
> + and convert them into Unicode.',
> +},
> );
>
> ###########################################################################
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Exmh CVS
iD8DBQFDxUafMJF5cimLx9ARAgduAJ9FwZN3Zs4c0HneoBh9Wrlptlr1FQCeIRBb
ALga6AvVU4T15EugNaAi1gQ=
=BkIK
-----END PGP SIGNATURE-----