On Dec 26, 7:30 pm, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote: > André wrote: > > In trying to parse html files using ElementTree running under Python > > 3.0a1, and using htmlentitydefs.py to add "character entities" to the > > parser, I found that I needed to create a customized version of > > htmlentitydefs.py to make things work properly. > > Can you please state what precise problem you were seeing? The original > code looks fine to me as it stands. >
As stated above, I was using ElementTree to parse an html file and sending the output to a browser. Without an additional parser, I was getting the following error message: Traceback (most recent call last): File "/Users/andre/CrunchySVN/branches/andre/src/http_serve.py", line 79, in do_POST self.server.get_handler(realpath)(self) File "src/plugins/handle_default.py", line 53, in handler data = path_to_filedata(request.path, root_path) File "src/plugins/handle_default.py", line 39, in path_to_filedata return cp.create_vlam_page(open(npath), path).read() File "/Users/andre/CrunchySVN/branches/andre/src/CrunchyPlugin.py", line 98, in create_vlam_page return vlam.CrunchyPage(filehandle, url, remote=remote, local=local) File "/Users/andre/CrunchySVN/branches/andre/src/vlam.py", line 62, in __init__ self.tree = parse(filehandle)#XmlFile(filehandle) File "/usr/local/py3k/lib/python3.0/xml/etree/ElementTree.py", line 823, in parse tree.parse(source, parser) File "/usr/local/py3k/lib/python3.0/xml/etree/ElementTree.py", line 561, in parse parser.feed(data) File "/usr/local/py3k/lib/python3.0/xml/etree/ElementTree.py", line 1201, in feed self._parser.Parse(data, 0) File "/usr/local/py3k/lib/python3.0/xml/etree/ElementTree.py", line 1157, in _default self._parser.ErrorColumnNumber) xml.parsers.expat.ExpatError: undefined entity é: line 401, column 11 So, I tried to specify an additional parser via if python_version >= 3: import htmlentitydefs class XmlFile(ElementTree.ElementTree): def __init__(self, file=None): ElementTree.ElementTree.__init__(self) parser = ElementTree.XMLTreeBuilder( target=ElementTree.TreeBuilder(ElementTree.Element)) ent = htmlentitydefs.entitydefs for entity in ent: if entity not in parser.entity: parser.entity[entity] = ent[entity] self.parse(source=file, parser=parser) return The output was "wrong". For example, one of the test I used was to process a copy of the main dict of htmlentitydefs.py inside an html page. A few of the characters came ok, but I got things like: 'Α': 0x0391, # greek capital letter alpha, U+0391 When using my modified version, I got the following (which may not be transmitted properly by email...) 'Α': 0x0391, # greek capital letter alpha, U+0391 It does look like a Greek capital letter alpha here. > > It does work for me ... but I don't know enough about unicode to be > > sure that it is a proper bug, and not a quirk due to the way I wrote > > my app. > > Without knowing what the actual problem is, it is hard to tell. I hope the above is of some help. Regards, André > > Regards, > Martin -- http://mail.python.org/mailman/listinfo/python-list