On Jun 16, 2010, at 9:05 AM, David E. Wheeler wrote: > On Jun 16, 2010, at 2:34 AM, Michael Ludwig wrote: > >> Try passing the parser options as a hash reference: >> >> my $doc = $parser->parse_html_string($str, {encoding => 'utf-8'}); > > WTF! That fixes it! I don't understand why it seems to be ignoring the > encoding set in the constructor. But I've noticed the same thing with other > options. Seems like there's some consistency to be worked out in XML::LibXML > options, still.
Okay, a bit more information: this was not quite it, alas. >> In order to print Unicode text strings (as opposed to octet strings) >> correctly to a terminal (UTF-8 or not), add the following line before >> the first output: >> >> binmode STDOUT, ':utf8'; >> >> But note that STDOUT is global. > > Yes, I do this all the time. Surprisingly, I don't get warnings for this > script, even though it is outputting multibyte characters. This is key. If I set the binmode on STDOUT to :utf8, the bogus characters print out bogus. If I set it to :raw, they come out right after processing by both Encode and XML::LibXML (I'm assuming they're interpreted as latin-1). So my question is this: Why isn't Encode dying when it runs into these characters? They're not valid utf-8, AFAICT. Are they somehow valid utf8 (that is, valid in Perl's internal format)? Why would they be? I think what I need is some code to strip non-utf8 characters from a string -- even if that string has the utf8 bit switched on. I thought that Encode would do that for me, but in this case apparently not. Anyone got an example? Thanks, David