> Malformed UTF-8 character (unexpected non-continuation byte 0x73, > immediately after start byte 0xe9) in substitution iterator at > /usr/lib/perl5/site_perl/5.8.3/i386-linux-thread-multi/HTML/Entities.pm > line 435, <DATA> line 1. > Segmentation fault >
I think this is an internal 'utf-8 flag' problem. I'm not a Perl internals expert, but there seems to be some funny goings on internally where Latin-1 strings are sometimes stored as Latin-1 and sometimes as UTF-8. The magic UTF-8 flag has to match the internal representation of the string. In this case, the UTF-8 flag is set, but the internal representation is not UTF-8 but Latin-1. That causes the low-level string parser to barf. It might be possible to find a workaround by converting explicitly to UTF-8 and maybe manually setting the flag. I'm not sure that'd help though. See the Encode docs for more info on how you would do this. Internals experts: I find the "magic" Latin-1-ization of my strings to be a pain in the neck sometimes. No doubt it works wonders for backwards compatibility at times, but if I need to send UTF-8 out to external modules I often need to make sure it is really UTF-8 I send. It is a pain to have to check the string to see if it is Latin-1 or UTF-8. Is there a way to stop the magic? =Ed Batutis