OK, I'm still not getting this unicode business.

Given this document:
==========================
<?xml version="1.0" encoding="utf-8" ?>

<document>
    <a>a&#224;&#225;&#226;&#227;</a>
    <e>e&#232;&#233;&#234;&#235;</e>
    <i>i&#236;&#237;&#238;&#239;</i>
    <o>o&#242;&#243;&#244;&#245;</o>
    <u>o&#249;&#250;&#251;&#252;</u>
</document>
==========================
(If testing, make sure you save this as utf-8 encoded.)

and this Python script:
==========================
import sys
from xml.dom.minidom import *
from xml.dom import *
import codecs
import string

CHARACTERS = range(128,255)

def unicode2charrefs(s):
    "Returns a unicode string with all the non-ascii characters from the
    given unicode string converted to character references."
    result = u""
    for c in s:
        code = ord(c)
        if code in CHARACTERS:
            result += u"&#" + string.zfill(str(code), 3).decode('utf-8')
            + u";"
        else:
            result += c.encode('utf-8')
    return result

def main():
    print "Parsing file..."
    file = codecs.open(sys.argv[1], "r", "utf-8")
    document = parse(file)
    file.close()
    print "done."

    print document.toxml(encoding="utf-8")
    out_str = unicode2charrefs(document.toxml(encoding="utf-8"))

    print "Writing to '" + sys.argv[1] + "~' ..."
    file = codecs.open(sys.argv[1] + "~", "w", "utf-8")
    file.write(out_str)
    file.close()
    print "done."

if __name__ == "__main__": main()
==========================

Does anyone else get this output from the "print
document.toxml(encoding="utf-8")" line:
<document>
    <a>aàáâã</a>
    <e>eèéêë</e>
    <i>iìíîï</i>
    <o>oòóôõ</o>
    <u>oùúûü</u>
</document>

and, similarly, this output document:
==========================
<?xml version="1.0" encoding="utf-8"?>
<document>
    <a>a&#195;&#160;&#195;&#161;&#195;&#162;&#195;&#163;</a>
    <e>e&#195;&#168;&#195;&#169;&#195;&#170;&#195;&#171;</e>
    <i>i&#195;&#172;&#195;&#173;&#195;&#174;&#195;&#175;</i>
    <o>o&#195;&#178;&#195;&#179;&#195;&#180;&#195;&#181;</o>
    <u>o&#195;&#185;&#195;&#186;&#195;&#187;&#195;&#188;</u>
</document>
==========================

i.e., does anyone else get two byte sequences beginning with
capital-A-with-tilde instead of the expected characters?

I'm using the Kate editor from KDE and Konsole (using bash) shell on
Linux (2.6 kernel). Does that make any difference?

I've just tried it on the unicode-aware xterm and the "print
document.toxml(encoding="utf-8")" line produces the expected output but
the output file is still wrong.

Any ideas whats wrong?

Cheers,
Richard
-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to