lxml removing tag, keeping text order

Tim Arnold Fri, 24 Oct 2008 21:10:40 -0700

Hi,
Using lxml to clean up auto-generated xml to validate against a dtd; I need 
to remove an element tag but keep the text in order. For example
s0 = '''
<option>
  <optional> first text
    <someelement>ladida</someelement>
    <emphasis>emphasized text</emphasis>
    middle text
    <anotherelement/>
    last text
  </optional>
</option>'''


I want to get rid of the <emphasis> tag but keep everything else as it is; 
that is, I need this result:

<option>
  <optional> first text
    <someelement>ladida</someelement>
    emphasized text
    middle text
    <anotherelement/>
    last text
  </optional>
</option>

I'm beginning to think this an impossible task, so I'm asking here to see if 
there is some method that will work. What I've done so far is this:

(outer encloses the parent, outside is the parent, inside is the child to 
remove)
from lxml import etree
import copy
def rm_tag(elem, outer, outside, inside):
    newdiv = etree.Element(outside)
    newdiv.text = ''
    for e0 in elem.getiterator(outside):
        for i,e1 in enumerate(e0.getiterator()):
            if i == 0:
                if e1.text: newdiv.text += e1.text
            elif (e1.tag != inside):
                newdiv.append(copy.deepcopy(e1))
            elif (e1.text):
                newdiv.text += e1.text

    for t in elem.getiterator():
        if t.tag == outer:
            t.clear()
            t.append(newdiv)
            break
    return etree.ElementTree(elem)

print 
etree.tostring(rm_tag(el,'option','optional','emphasis'),pretty_print=True)

But the text is messed up using this method. I see why it's wrong, but not 
how to make it right.
It returns:
<option>
  <optional> first text
    emphasized text
    <someelement>ladida</someelement>
    <anotherelement/>
    last text
  </optional>
</option>

Maybe I should send the outside element (via tostring) to a regexp for 
removing the child and return that string? Regexp? Getting desperate, hey.

Any pointers much appreciated,
--Tim Arnold


--
http://mail.python.org/mailman/listinfo/python-list

lxml removing tag, keeping text order

Reply via email to