>> So what if the unicode string doesn't start with an XML declaration? >> Will it add one? > > No.
Ok. So the XML document would be ill-formed then unless the encoding is UTF-8, right? > The point of this code is not just to return whether the string starts > with "<?xml" or not. There are actually three cases: Still, it's overly complex for that matter: > * The string does start with "<?xml" if s.startswith("<?xml"): return Yes > * The string starts with a prefix of "<?xml", i.e. we can only > decide if it starts with "<?xml" if we have more input. if "<?xml".startswith(s): return Maybe > * The string definitely doesn't start with "<?xml". return No >> What bit fiddling are you referring to specifically that you think >> is better done in C than in Python? > > The code that checks the byte signature, i.e. the first part of > detect_xml_encoding_str(). I can't see any *bit* fiddling there, except for the bit mask of candidates. For the candidate list, I cannot quite understand why you need a bit mask at all, since the candidates are rarely overlapping. I think there could be a much simpler routine to have the same effect. - if it's less than 4 bytes, answer "need more data". - otherwise, implement annex F "literally". Make a dictionary of all prefixes that are exactly 4 bytes, i.e. prefixes4 = {"\x00\x00\xFE\xFF":"utf-32be", ... ..., "\0\x3c\0\x3f":"utf-16le"} try: return prefixes4[s[:4]] except KeyError: pass if s.startswith(codecs.BOM_UTF16_BE):return "utf-16be" ... if s.startswith("<?xml"): return get_encoding_from_declaration(s) return "utf-8" Regards, Martin _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com