Hi list, First of all, I wish you all a happy 2006. I have a small question that googling didn't turn up an answer for. So hopefully you'll be kind enough to send me in the right direction.
I'm developing a desktop application, called Task Coach, that saves its domain objects (tasks, mostly :-) in an XML file. Users have reported that sometimes their Task Coach file would become unreadable by Task Coach after copying information from some other application into e.g. a task description. Looking at the 'corrupted' file showed that control characters ended up in the XML file (Control-K for example). Task Coach uses xml.dom to create an XML document and save it, like this: class XMLWriter: ... def write(self, taskList): domImplementation = xml.dom.getDOMImplementation() self.document = domImplementation.createDocument(None, 'tasks', None) ... for task in taskList.rootTasks(): self.document.documentElement.appendChild(self.taskNode(task)) self.document.writexml(self.__fd) # __fd is a file open for writing ... Apparently, the writexml method of xml.dom (which comes from xml.dom.minidom if pyxml is not installed I think) does not feel that writing control characters in an XML file is wrong, but the parser does: Traceback (most recent call last): ... File "c:\Program Files\Python24\lib\xml\dom\expatbuilder.py", line 207, in parseFile parser.Parse(buffer, 0) xml.parsers.expat.ExpatError: not well-formed (invalid token): line 77, column 147 Rightfully so, because ^K is not valid XML 1.0, according to http://www.w3.org/TR/REC-xml/: "Legal characters are tab, carriage return, line feed, and the legal characters of Unicode and ISO/IEC 10646. [...] Consequently, XML processors MUST accept any character in the range specified for Char. Character Range Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]" So, all this leads me to the following questions: - Why does the writexml method of the document created by the object returned by domImplementation() allow control characters? Isn't that a bug? - What is the easiest/most pythonic (preferably build-in) way of checking a unicode string for control characters and weeding those characters out? Thanks, Frank -- http://mail.python.org/mailman/listinfo/python-list