Hi, Using lxml to clean up auto-generated xml to validate against a dtd; I need to remove an element tag but keep the text in order. For example s0 = ''' <option> <optional> first text <someelement>ladida</someelement> <emphasis>emphasized text</emphasis> middle text <anotherelement/> last text </optional> </option>'''
I want to get rid of the <emphasis> tag but keep everything else as it is; that is, I need this result: <option> <optional> first text <someelement>ladida</someelement> emphasized text middle text <anotherelement/> last text </optional> </option> I'm beginning to think this an impossible task, so I'm asking here to see if there is some method that will work. What I've done so far is this: (outer encloses the parent, outside is the parent, inside is the child to remove) from lxml import etree import copy def rm_tag(elem, outer, outside, inside): newdiv = etree.Element(outside) newdiv.text = '' for e0 in elem.getiterator(outside): for i,e1 in enumerate(e0.getiterator()): if i == 0: if e1.text: newdiv.text += e1.text elif (e1.tag != inside): newdiv.append(copy.deepcopy(e1)) elif (e1.text): newdiv.text += e1.text for t in elem.getiterator(): if t.tag == outer: t.clear() t.append(newdiv) break return etree.ElementTree(elem) print etree.tostring(rm_tag(el,'option','optional','emphasis'),pretty_print=True) But the text is messed up using this method. I see why it's wrong, but not how to make it right. It returns: <option> <optional> first text emphasized text <someelement>ladida</someelement> <anotherelement/> last text </optional> </option> Maybe I should send the outside element (via tostring) to a regexp for removing the child and return that string? Regexp? Getting desperate, hey. Any pointers much appreciated, --Tim Arnold -- http://mail.python.org/mailman/listinfo/python-list