[EMAIL PROTECTED] schrieb: > Is there a Pythonic way to read the file and identify any illegal XML > characters so I can strip them out? this would keep my program more > flexible - if the vendor is going to allow one illegal character in > their document, there's no way of knowing if another one will pop up > later.
Notice that you are talking about bytes here, not characters. It is inherently difficult to determine invalid bytes - you first have to determine the encoding, then (mentally) decode, and then find out whether there are any invalid characters. The invalid XML characters can be found in http://www.w3.org/TR/2006/REC-xml-20060816/#charsets So invalid characters are #x0 .. #x8, #xB, #xC, #xE .. #x1F, #xD800 .. #xDFFF, #xFFFE, #xFFFF. If you restrict attention to only the invalid characters below #x20 (i.e. control characters), and also restrict attention to encodings that are strict ASCII supersets (ASCII, ISO-8859-x, UTF-8), you can filter out the invalid characters on the byte level. Otherwise, you have to decode, filter out on the character level, and then encode again. Neither approach will deal with bytes that are invalid wrt. the encoding. To filter out these bytes, I recommend to use str.translate. Make an identity table for the substitution, and put the bytes you want deleted into the delete table. Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list