On Jun 16, 2010, at 9:05 AM, David E. Wheeler wrote:

> On Jun 16, 2010, at 2:34 AM, Michael Ludwig wrote:
> 
>> Try passing the parser options as a hash reference:
>> 
>> my $doc = $parser->parse_html_string($str, {encoding => 'utf-8'});
> 
> WTF! That fixes it! I don't understand why it seems to be ignoring the 
> encoding set in the constructor. But I've noticed the same thing with other 
> options. Seems like there's some consistency to be worked out in XML::LibXML 
> options, still.

Okay, a bit more information: this was not quite it, alas.

>> In order to print Unicode text strings (as opposed to octet strings)
>> correctly to a terminal (UTF-8 or not), add the following line before
>> the first output:
>> 
>> binmode STDOUT, ':utf8';
>> 
>> But note that STDOUT is global.
> 
> Yes, I do this all the time. Surprisingly, I don't get warnings for this 
> script, even though it is outputting multibyte characters.

This is key. If I set the binmode on STDOUT to :utf8, the bogus characters 
print out bogus. If I set it to :raw, they come out right after processing by 
both Encode and XML::LibXML (I'm assuming they're interpreted as latin-1).

So my question is this: Why isn't Encode dying when it runs into these 
characters? They're not valid utf-8, AFAICT. Are they somehow valid utf8 (that 
is, valid in Perl's internal format)? Why would they be?

I think what I need is some code to strip non-utf8 characters from a string -- 
even if that string has the utf8 bit switched on. I thought that Encode would 
do that for me, but in this case apparently not. Anyone got an example?

Thanks,

David


Reply via email to