[Bug 4636] Charset normalization plugin support

bugzilla-daemon Tue, 18 Oct 2005 14:53:53 -0700

http://bugzilla.spamassassin.org/show_bug.cgi?id=4636






------- Additional Comments From [EMAIL PROTECTED]  2005-10-18 14:53 -------
(In reply to comment #6)
> Earlier in the ticket you were talking about header normalization.  Body
> normalization is a different beast (but it's easier to deal with imo).

They are both components of this enhancement.  The header argument is easier to
make, though Node::rendered() has a similar argument since charset normalization
has to happen before it feeds the text into HTML::Parser.

> It's worth noting that this is actually going to be a much larger issue
> than just having a plugin, btw.  The main problem is that SpamAssassin
> very specifically disables unicode in every module via "use bytes"
> (according to the svn log it looks like it was added in at r3997 back
> in December 2002).

I got rid of a bunch of those in r315047.  I should audit the remaining ones.

> I was thinking that the plugin would be called by check_start, then
> get an array of parts via find_parts(), then do any manipulation of
> the data as required per-part (either dealing with the decoded or the
> rendered portions, or both).

Something like this would make sense if normalization/rendering were done
outside Message::Node.  It would be harder to do lazy normalization of headers
that way.

> Potentially, there'd be a new function in Message like
> "clear_rendered_cache" or something which would delete the cached forms
> of text_rendered, text_visible_rendered, text_invisible_rendered, and
> (if necessary/different function) text_decoded.

If the cache got filled before the normalization plugin got invoked, that would
indicate a bug in the order of execution.

> It's not very clean from an OO perspective.  Arguably we'd always want
> to make sure the message is in utf-8 format internally, and so the code
> could just be in Message::Node.

There's no clean OO separation between data and view, but that was
preexisting--Message::Node already knows almost everything about SpamAssassin's
view of MIME entities.

There is the issue whether charset normalization should be:

A) a plugin

B) hardcoded but enabled/disabled by config

C) hardcoded, always on for sufficiently recent versions of Perl

Once I get a charset normalizer hooked up I can get some numbers on how much it
costs.  I was operating under the assumption that it would be too expensive to
enable for everybody.

I was thinking that having a plugin allowed people more flexibility in tuning
the normalization process, but perhaps that's not strictly necessary.




------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.

[Bug 4636] Charset normalization plugin support

Reply via email to