"Martin v. Löwis" sagte: >>> So what if the unicode string doesn't start with an XML declaration? >>> Will it add one? >> >> No. > > Ok. So the XML document would be ill-formed then unless the encoding is > UTF-8, right?
I don't know. Is an XML document ill-formed if it doesn't contain an XML declaration, is not in UTF-8 or UTF-8, but there's external encoding info? If it is, then yes, the document would be ill-formed. >> The point of this code is not just to return whether the string starts >> with "<?xml" or not. There are actually three cases: > > Still, it's overly complex for that matter: > >> * The string does start with "<?xml" > > if s.startswith("<?xml"): > return Yes > >> * The string starts with a prefix of "<?xml", i.e. we can only >> decide if it starts with "<?xml" if we have more input. > > if "<?xml".startswith(s): > return Maybe > >> * The string definitely doesn't start with "<?xml". > > return No This looks good. Now we would have to extent the code to detect and replace the encoding in the XML declaration too. >>> What bit fiddling are you referring to specifically that you think >>> is better done in C than in Python? >> >> The code that checks the byte signature, i.e. the first part of >> detect_xml_encoding_str(). > > I can't see any *bit* fiddling there, except for the bit mask of > candidates. For the candidate list, I cannot quite understand why > you need a bit mask at all, since the candidates are rarely > overlapping. I tried many variants and that seemed to be the most straitforward one. > I think there could be a much simpler routine to have the same > effect. > - if it's less than 4 bytes, answer "need more data". Can there be an XML document that is less then 4 bytes? I guess not. > - otherwise, implement annex F "literally". Make a dictionary > of all prefixes that are exactly 4 bytes, i.e. > > prefixes4 = {"\x00\x00\xFE\xFF":"utf-32be", ... > ..., "\0\x3c\0\x3f":"utf-16le"} > > try: return prefixes4[s[:4]] > except KeyError: pass > if s.startswith(codecs.BOM_UTF16_BE):return "utf-16be" > ... > if s.startswith("<?xml"): > return get_encoding_from_declaration(s) > return "utf-8" get_encoding_from_declaration() would have to do the same yes/no/maybe decision. But anyway: would a Python implementation of these two functions (detect_encoding()/fix_encoding()) be accepted? Servus, Walter _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com