[EMAIL PROTECTED] wrote:

I am trying to use perl on the command line to process text files in
various ways, one of which is to decode html entities. As far as I can
see, the following line should work

perl -MHTML::Entities -p -e 'decode_entities($_)'  <input.txt
output.txt

it does indeed change the html entities, but not into the required
characters, rather into pairs of unusual characters; and the command
line returns this:

Wide character in print, <> line 1.

It seems to me it is something to do with internal character encoding
being messed up but I can't work out how to control it. The text files
processed have MacOS character encoding which is required in the
finished file, but perhaps I need to convert to UTF8 before processing
and back again after?

(I am seriously new to this - only started looking at Perl yesterday!)

HTML Entities are Unicode entities from a set of many thousands of
different characters, which cannot be encoded into a single data byte.
decode_entities() uses UTF-8 encoding, which corresponds to ASCII
encoding for the first 128 characters: beyond that the character will
use two or more data bytes to represent it.

Rob


--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/


Reply via email to