John Stracke <[EMAIL PROTECTED]> writes:

> > HTML::Parser decode entities with the 'dtext' argspec and leave them
> > alone for 'text'.
> 
> I'm not specifying dtext, and &nbsp; is getting decoded.
> 
> Uh...but I might be using an old form of the interface, with different
> defaults.  My subclass's constructor just calls HTML::Parser->new().

For v2 undecoded text should still be the default.  But entitites _will_
be decoded in attribute values.

If you want UTF8 output then it should just be a matter of
transforming the data to UTF8 afterwards.  The Unicode::String module
should be usable here.

For perl 5.6 there will be some problems if the input to the parser is
UTF8, because you then end up with a mix of UTF8 encoded chars and
latin1 changes where entity decoding has taken place.

Regards,
Gisle

Reply via email to