On January 9, 2006 at 22:19, Jeff Breidenbach wrote:

> When mhonarc is producing UTF-8 using the TEXTENCODE resource, does it
> ever produce invalid UTF-8? I ask because I'm taking some mhonarc
> output, stripping the HTML, then feeding the results to a Perl based text
> analysis program. Which occasionally complains bitterly, for example:
> 
> Malformed UTF-8 character (unexpected continuation byte 0x85, with no
> preceding start byte)

I've made attempts to deal with malformed UTF-8, but I will have
to look into it.  With TEXTENCODE, and perl >= 5.8, MHonArc utilizes
the Encode module to do the encoding, so it may be a factor.  With
perl < 5.8, I've tried to deal with it as best as I know how.

Taking a quick look at the code, if the input is formally tagged
as us-ascii or utf-8, mhonarc passes the data as-is if encoding
to UTF-8.  Therefore, if the source has bad sequences, then the
final output will also have them.  It may be worth considering
if mhonarc should do a sanity check on the data even if the
source claims to be utf-8.  There may be security implications.

If you can provide me with a sample message, I can check it out.

--ewh

---------------------------------------------------------------------
To sign-off this list, send email to [EMAIL PROTECTED] with the
message text UNSUBSCRIBE MHONARC-DEV

Reply via email to