Daniel Quinlan wrote:
What do you estimate the overhead would be?
Hard to estimate without some design choices, like when exactly to run
the charset detector. The charset detector runs about 10-20 state
machines over the text, in parallel. The conversion itself is another
pass and another copy of the text. When the text has characters outside
of iso-8859-1, one then has to pay the cost of Perl's Unicode regex
support for each of the rules. That means the cost will depend on the
percentage of non-iso-8859-1 messages in the message stream.
What is the license of [Mozilla's universal charset detector]?
MPL.
We can probably safely up the requirement for HTML::Parser in our next
major revision. Conditioning is also okay.
Without the second pack call, any non-iso-8859-1 character entities will
cause the output to have the utf-8 bit set and thus engage Perl's
Unicode regex support.
We should pay special attention to behaving as MUAs. I believe some
MUAs will actually ignore the MIME character set and use the one
specified in the message HTML (if it is HTML). We shouldn't necessarily
assume all MUAs have been configured to use the local character set at
all times.
Something to look into. One would have to pre-parse the HTML to see if
there is a charset label.