There seems to be quite a bit of confusion when it comes to Python and encodings. The following PEP discusses Python and Unicode and gives some insights.
http://www.python.org/dev/peps/pep-0100/ With py3k this confusion should reduce very much since it unifies str and unicode types and reduces the encoding problem and uses a different type "bytes" for any encoded (binary) data. http://docs.python.org/dev/3.0/whatsnew/3.0.html --Anand On Thu, Mar 20, 2008 at 10:57 AM, Gurpreet Sachdeva <[EMAIL PROTECTED]> wrote: > Thanks Anand for your help. Forwarding your post to the group. > > Regards, > Gurpreet Singh > > > > ---------- Forwarded message ---------- > From: Anand Balachandran Pillai <[EMAIL PROTECTED]> > Date: Wed, Mar 19, 2008 at 11:48 PM > Subject: Re: [BangPypers] Handling unicode characters in xml.dom > To: Gurpreet Sachdeva <[EMAIL PROTECTED]> > > > Hi Gurpreet, > > The problem is that you have some junk characters in the file > (mostly Japanese > unicode, since the original file seems to be japanese), which are appearing > as Ctrl characters in ascii encoding. When the parser tries to parse the > file > it interprets the first Ctrl character (^S) as a newline, so it thinks > there is an > extra break in the text and produces a "not well-formed token" error. > > The way to solve this is to decode and encode the file again in a > different > encoding than ascii. I tried iso-8859-1 decoding and unicode-escape > encoding > and it works. For this you need to use the services of the codecs module > since > default file objects in Python can only write ascii text. > > Here is the full code... > --------------------------------------------- > import codecs > import xml.dom.minidom as mdom > > data =open('problem.xml').read() > f = open('problem2.xml','w') > > e = codecs.EncodedFile(f, 'iso-8859-1','unicode-escape') > e.write(data) > e.close() > data = open('problem2.xml').read() > data = '\n'.join(data.split("\\r\\n")) > open('problem2.xml','w').write(data) > > print mdom.parse('problem2.xml') > -------------------------------------------------- > > The unicode-escape encoding interprets the characters and converts > them to their hex equivalent, but it escapes newlines to the "\r\n" > character. > So we replace these chars again with "\n" by splitting data and joining it. > > The modified file is saved in problem2.xml . > > Btw, can you forward this to the list. I am on a slow connection hence > using > html interface to gmail and hence address completion is missing. > > HTH, > > --Anand > > > > On 3/19/08, Gurpreet Sachdeva <[EMAIL PROTECTED]> wrote: > > Hi Anand, > > > > Please find attached the xml file that contains the garbage characters. > Is > > there a way we can handle them? > > > > Thanks for your help. > > Gurpreet > > > > On Tue, Mar 18, 2008 at 1:22 PM, Anand Balachandran Pillai < > > [EMAIL PROTECTED]> wrote: > > > > > Is the garbage CDATA or attribute data ? > > > > > > CDATA is like <elem>text</elem> and attribute > > > is <elem attr="value" /> > > > > > > Can you pase the relevant part of the XML file here or if it is > > > small enough, the complete XML file ? Send it directly to me > > > since the list removes attachments. > > > > > > --Anand > > > > > > On Tue, Mar 18, 2008 at 11:05 AM, Gurpreet Sachdeva > > > <[EMAIL PROTECTED]> wrote: > > > > <?xml version="1.0" encoding="UTF-8"?> > > > > > > > > Still the problem exists. > > > > > > > > - Gurpreet > > > > > > > > > > > > > > > > On Tue, Mar 18, 2008 at 10:44 AM, Anand Balachandran Pillai > > > > <[EMAIL PROTECTED]> wrote: > > > > > > > > > What is the encoding of your XML file ? i.e in the > > > > > string "<?xml version="1.0" encoding="<encoding>"?>, > > > > > what is <encoding> ? > > > > > > > > > > Make sure it is an encoding like utf-8 or iso-8859-1 > > > > > which can help the parser to understand garbage > > > > > chars. > > > > > > > > > > --Anand > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Mar 18, 2008 at 10:38 AM, Gurpreet Sachdeva > > > > > <[EMAIL PROTECTED]> wrote: > > > > > > Hi, > > > > > > > > > > > > Any idea how to handle the unicode characters existing in an xml > > > file > > > > while > > > > > > parsing it. > > > > > > > > > > > > This is what I am doing: > > > > > > > > > > > > from xml.dom import minidom > > > > > > > > > > > > xmlObj = minidom.parse(fileobj) > > > > > > > > > > > > And the script throws an error because of some special characters > > > ['f > > > > > > (3gpÕ¡¤ë'] present in the xml file. Any suggestion/pointers would > > be > > > > > > appreciated > > > > > > > > > > > > Thanks and Regards, > > > > > > Gurpreet Singh > > > > > > _______________________________________________ > > > > > > BangPypers mailing list > > > > > > BangPypers@python.org > > > > > > http://mail.python.org/mailman/listinfo/bangpypers > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > -Anand > > > > > _______________________________________________ > > > > > BangPypers mailing list > > > > > BangPypers@python.org > > > > > http://mail.python.org/mailman/listinfo/bangpypers > > > > > > > > > > > > > > > > > > > > > -- > > > > Thanks and Regards, > > > > Gurpreet Singh > > > > _______________________________________________ > > > > BangPypers mailing list > > > > BangPypers@python.org > > > > http://mail.python.org/mailman/listinfo/bangpypers > > > > > > > > > > > > > > > > > > > > -- > > > -Anand > > > _______________________________________________ > > > BangPypers mailing list > > > BangPypers@python.org > > > http://mail.python.org/mailman/listinfo/bangpypers > > > > > > > > > > > -- > > Thanks and Regards, > > Gurpreet Singh > > > -- > -Anand > > > > -- > > Thanks and Regards, > Gurpreet Singh > _______________________________________________ > BangPypers mailing list > BangPypers@python.org > http://mail.python.org/mailman/listinfo/bangpypers > > -- -Anand _______________________________________________ BangPypers mailing list BangPypers@python.org http://mail.python.org/mailman/listinfo/bangpypers