Re: Preliminary design proposal for charset normalization support in SpamAssassin

John Gardiner Myers Sat, 20 Aug 2005 17:57:12 -0700

Daniel Quinlan wrote:

What do you estimate the overhead would be?

Hard to estimate without some design choices, like when exactly to runthe charset detector. The charset detector runs about 10-20 statemachines over the text, in parallel. The conversion itself is anotherpass and another copy of the text. When the text has characters outsideof iso-8859-1, one then has to pay the cost of Perl's Unicode regexsupport for each of the rules. That means the cost will depend on thepercentage of non-iso-8859-1 messages in the message stream.

What is the license of [Mozilla's universal charset detector]?

MPL.

We can probably safely up the requirement for HTML::Parser in our next
major revision.  Conditioning is also okay.

Without the second pack call, any non-iso-8859-1 character entities willcause the output to have the utf-8 bit set and thus engage Perl'sUnicode regex support.

We should pay special attention to behaving as MUAs.  I believe some
MUAs will actually ignore the MIME character set and use the one
specified in the message HTML (if it is HTML).  We shouldn't necessarily
assume all MUAs have been configured to use the local character set at
all times.

Something to look into. One would have to pre-parse the HTML to see ifthere is a charset label.

Re: Preliminary design proposal for charset normalization support in SpamAssassin

Reply via email to