Hi Leon

> can anyone tell me how to go about dealing with weird 
> characters in XML files when parsing them with xml-parser.
> I am working with files generated from MS databases, which
> contain characters such as "..." compressed into one character.

Does your XML document start with an XML declaration?  eg:

  <?xml version='1.0'?>

If there is a declaration, does it specify the encoding?

If there is no declaration, or there is but it does not specify
an encoding, then the encoding will default to UTF-8.  If the
ellipsis is encoded as the hex character 0x85 (as it was when
I pasted one from MSWord to notepad) then this is indeed not
a valid character sequence for a UTF-8 encoded document and
XML-Spy is wrong to say that it is well-formed.

You could try specifying that the encoding is ISO-8859-1 using
an XML declaration like this:

<?xml encoding='ISO-8859-1'?>

Two things to note:

  1. don't put any whitespace before the XML declaration
  2. When XML::Parser reads the file it will convert the 
     characters to UTF-8 for you - so you'll still have wierd
     characters in your database, just different wierd characters
     (although, if you specify a UTF-8 char set in your web page
     headers, some browsers will render them correctly)

As another reader suggested, you could slurp the XML into a string,
clean up the string with something like:

  s/\x85/.../sg

and then pass the result to XML::Parser.

Good luck.

Grant

=====================================================================
Grant McLean       | email: [EMAIL PROTECTED] | Lvl 8, 86 Lambton Quay
The Web Limited    | WWW:   www.web.co.nz    | PO Box 15-175
Internet Solutions | Tel:   (04) 495 8250    | Wellington 
Awesome service    | Fax:   (04) 495 8259    | New Zealand

_______________________________________________
ActivePerl mailing list
[EMAIL PROTECTED]
http://listserv.ActiveState.com/mailman/listinfo/activeperl

Reply via email to