On January 9, 2006 at 22:19, Jeff Breidenbach wrote: > When mhonarc is producing UTF-8 using the TEXTENCODE resource, does it > ever produce invalid UTF-8? I ask because I'm taking some mhonarc > output, stripping the HTML, then feeding the results to a Perl based text > analysis program. Which occasionally complains bitterly, for example: > > Malformed UTF-8 character (unexpected continuation byte 0x85, with no > preceding start byte)
I've made attempts to deal with malformed UTF-8, but I will have to look into it. With TEXTENCODE, and perl >= 5.8, MHonArc utilizes the Encode module to do the encoding, so it may be a factor. With perl < 5.8, I've tried to deal with it as best as I know how. Taking a quick look at the code, if the input is formally tagged as us-ascii or utf-8, mhonarc passes the data as-is if encoding to UTF-8. Therefore, if the source has bad sequences, then the final output will also have them. It may be worth considering if mhonarc should do a sanity check on the data even if the source claims to be utf-8. There may be security implications. If you can provide me with a sample message, I can check it out. --ewh --------------------------------------------------------------------- To sign-off this list, send email to [EMAIL PROTECTED] with the message text UNSUBSCRIBE MHONARC-DEV
