Re: Preliminary design proposal for charset normalization support in SpamAssassin

Justin Mason Tue, 23 Aug 2005 09:44:51 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


So -- fundamentally, I think I'm:

+1 on the idea of charset normalization to a UTF-8 form, at least for the
headers and one rendering of the message body.  I think that should be the
"body" rendering. I also think that the "rawbody" and "full" renderings
should remain un-normalized, as should the "Header:raw" pseudo-header
selector.

By doing that, we allow "words and phrases" rules to use the normalized
charset format of the body, but we allow the "structural" rules to see
the un-normalized format and possibly catch encoding tricks.

(Note that that means that
Mail::SpamAssassin::Message::get_decoded_body_text_array() doesn't need to
do normalization.)


+1 on normalization being optional; not yet decided whether it should
default to on or off, since if it's always on, that makes writing rules
easier, but may be too slow for many sites.  in my opinion we should
benchmark to determine that.   By implementing the normalization hooks
in a plugin, that takes care of that, since it can be disabled or enabled
by commenting the loadplugin line.


+1 on using Mozilla's universal charset detector, if that's implemented as
a CPAN module, and that module is listed as an *optional* dependency for
SpamAssassin.  In my opinion we can probably safely fall back to using the
declared charset header, if it's not available, even if that's a little
more spoofable.


+1 on fixing Mail::SpamAssassin::HTML as described.


Daniel's point regarding MUA behaviour, and HTML declared charsets,
is a good one, btw... that *may* indicate that we'd have to re-decode
somehow if the HTML charset disagrees with the MIME charset. :(


So pretty much entirely +1 with only a couple of caveats. ;)

(It might be worth cannibalising some code from Matt's "spamassassin3"
work from that link I sent yesterday, if that's still useful btw.)

- --j.


John Gardiner Myers writes:
> The following is a preliminary proposal for how to add support for
> normalization of charsets into Perl's Unicode support.  The primary
> reason I want to do this work is to improve the ability of
> SpamAssassin to discriminate between Japanese ham and Japanese spam.
> 
> SpamAssassin currently ignores charset information, effectively
> assuming all mail is in iso-8859-1.  This works for users whose ham is
> encoded in iso-8859-1 and mostly works for users whose ham is encoded
> in other single-byte charsets.  For East Asian languages, this is
> insufficient for doing text analysis.
> 
> Since a large number of SpamAssassin users are likely to be
> uninterested in East Asian ham and thus unlikely to want to pay the
> cost of charset normalization, the normalization support needs to be
> optional, defaulting to off.
> 
> Some messages contain unlabeled charsets, others use MIME charset
> labels.  Some MIME charset labels are not useful
> (e.g. "unknown-8bit").  To handle such nonlabeled data, it is
> necessary to run a charset detector over the text in order to
> determine what to convert it from.  Encode::Guess effectively requires
> the caller to specify the language of the text, so I consider it too
> simplistic.  Better would be Mozilla's universal charset detector,
> which I would have to wrap up as a cpan module.
> 
> It is common for Korean messages to have an incorrect MIME label of
> "iso-8859-1", so it may be necessary to run a charset detector even
> over MIME-labeled charsets.
> 
> After the charset has been determined, either from the MIME label or
> the charset detector, the data needs to be converted from that charset
> to Perl's internal utf8 form.  Encode::decode() is the obvious choice
> for this, though I can see reasons why an installation might want to
> be able to replace the charset converters with some other
> implementation.
> 
> The following functions, immediately after they all
> Mail::SpamAssassin::Message::Node::decode, need to call a
> function that does charset normalization.
> 
> * Mail::SpamAssassin::Message::get_rendered_body_text_array
> * Mail::SpamAssassin::Message::get_visible_rendered_body_text_array
> * Mail::SpamAssassin::Message::get_decoded_body_text_array
> 
> Furthermore:
> 
> * Mail::SpamAssassin::Message::Node::_decode_header
> * Mail::SpamAssassin::Message::Node::__decode_header
> 
> also need to call a function to do charset normalization.
> _decode_header for unlabeled charset data, __decode_header for for
> MIME encoded-words.
> 
> This new charset normalization function will take as arguments the
> text and any MIME charset label.  The function calls the charset
> detector and converter as necessary and returns the normalized text in
> Perl's internal form.  The returned text will only have the utf8 flag
> set if the input charset was not us-ascii or iso-8859-1. 
> 
> This new charset normalization function should most likely use a
> plugin callback to do all the work, though it only makes sense for one
> loaded plugin to implement the callback.  If no plugin implements the
> callback, then it should simply return the input text, preserving the
> current behavior.
> 
> The other issue is that Mail::SpamAssassin::HTML uses two calls to
> pack("C0A*", ...) in order to strip Perl's utf-8 flag from text going
> into and out of HTML::Parser.  When doing charset normalization, these
> two pack calls need to be removed.  In order for HTML::Parser to
> correctly handle utf8, one needs minimum versions of Perl 5.8 and
> HTML::Parser 3.39_90.  HTML::Parser 3.43 might be a better minimum
> version--I haven't reviewed the severity of the utf8 bug fixed in that
> release.  I see two possibilities:
> 
> 1) Condition the two pack calls on version checks: (perl < 5.8 ||
>    HTML::Parser < 3.43)
> 
> 2) Condition the two pack calls on charset normalization disabled.
> 
> Comments?
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFDC1G4MJF5cimLx9ARAl1dAJ9Zv6nQmTdCfKysyJ+kDh2RYUmLgACfRg0A
cDorvcXWU8cUzOlrl/BDX1M=
=xKpM
-----END PGP SIGNATURE-----

Re: Preliminary design proposal for charset normalization support in SpamAssassin

Reply via email to