Hi Leon
> can anyone tell me how to go about dealing with weird
> characters in XML files when parsing them with xml-parser.
> I am working with files generated from MS databases, which
> contain characters such as "..." compressed into one character.
Does your XML document start with an XML declaration? eg:
<?xml version='1.0'?>
If there is a declaration, does it specify the encoding?
If there is no declaration, or there is but it does not specify
an encoding, then the encoding will default to UTF-8. If the
ellipsis is encoded as the hex character 0x85 (as it was when
I pasted one from MSWord to notepad) then this is indeed not
a valid character sequence for a UTF-8 encoded document and
XML-Spy is wrong to say that it is well-formed.
You could try specifying that the encoding is ISO-8859-1 using
an XML declaration like this:
<?xml encoding='ISO-8859-1'?>
Two things to note:
1. don't put any whitespace before the XML declaration
2. When XML::Parser reads the file it will convert the
characters to UTF-8 for you - so you'll still have wierd
characters in your database, just different wierd characters
(although, if you specify a UTF-8 char set in your web page
headers, some browsers will render them correctly)
As another reader suggested, you could slurp the XML into a string,
clean up the string with something like:
s/\x85/.../sg
and then pass the result to XML::Parser.
Good luck.
Grant
=====================================================================
Grant McLean | email: [EMAIL PROTECTED] | Lvl 8, 86 Lambton Quay
The Web Limited | WWW: www.web.co.nz | PO Box 15-175
Internet Solutions | Tel: (04) 495 8250 | Wellington
Awesome service | Fax: (04) 495 8259 | New Zealand
_______________________________________________
ActivePerl mailing list
[EMAIL PROTECTED]
http://listserv.ActiveState.com/mailman/listinfo/activeperl