Earl Hood
Sat, 21 May 2005 12:33:26 -0700
On May 21, 2005 at 02:00, Jeff Breidenbach wrote: > I'm seeing quite a few UTF-8 warnings on 2.6.11. Is this > expected? It believe so. The fix for bug #11187 activates perl's built-in UTF-8 sequence checks. It appears it is common for email tagged with utf-8 encoding to have invalid utf-8 sequences. I get a lot of warnings in the utf-8 sample message I have that contains (deliberate) malformed utf-8 sequences. Where the sequences are good, no warnings are generated. As noted in the bug's comments, I do not understand why I needed to make the fix in the first place. My guess is something was wrong with perl between different versions. According to latest docs at perldoc.perl.org, the lone 'U' template for unpack should work always, <http://perldoc.perl.org/perluniintro.html>: For UTF-8 only, you can use: use warnings; @chars = unpack("U0U*", $string_of_bytes_that_I_think_is_utf8); If invalid, a Malformed UTF-8 character (byte 0x##) in unpack warning is produced. The "U0" means "expect strictly UTF-8 encoded Unicode". Without that the unpack("U*", ...) would accept also data like chr(0xFF), similarly to the pack as we saw earlier. With that said, the fix is probably better since perl validates the sequence internally and generates a warning if the sequence is bad. Right now, I will not do any more research unless you (or someone else) can provide example UTF-8 input that should not generate malformed warning messages. If you can isolate a message that does generate the warnings, I can help you examine it to see if the warnings are justified. --ewh --------------------------------------------------------------------- To sign-off this list, send email to [EMAIL PROTECTED] with the message text UNSUBSCRIBE MHONARC-DEV