I'm trying to align an XML file with the original text file from which it was created. Unfortunately, the XML version of the file has added and removed some of the whitespace. For example::
>>> plain_text = ''' ... Pacific First Financial Corp. said shareholders approved its ... acquisition. ... ''' >>> xml_text = ''' <s>Pacific First Financial Corp. ... <EVENT eid="e1" class="REPORTING" > said </EVENT> shareholders ... <EVENT eid="e2" class="OCCURRENCE" >approved</EVENT> its ... <EVENT eid="e8" class="OCCURRENCE" > acquis ition </EVENT>. ... </s> ... ''' I want to determine which offsets in the *original* text each element from the XML text is supposed to cover. So I want something like:: >>> xml_tree = etree.fromstring(xml_text) >>> align(xml_tree, plain_text) [(<Element 'EVENT' at 01411B00>, 31, 35), (<Element 'EVENT' at 01411EA8>, 49, 57), (<Element 'EVENT' at 01411E18>, 62, 73), (<Element 's' at 01411FC8>, 1, 74)] where ``align`` has returned a list of all elements in the XML text along with their start and end indices in the original text:: >>> plain_text[31:35] 'said' >>> plain_text[49:57] 'approved' >>> plain_text[62:73] 'acquisition' Note that I want to ignore whitespace as much as possible, so the elements are aligned only to the non-whitespace text they include. Below is my current implementation of the ``align`` function. It seems pretty messy to me -- can anyone offer me some advice on how to clean it up or write it differently? def align(tree, text): def align_helper(elem, elem_start): # skip whitespace in the text before the element while text[elem_start:elem_start + 1].isspace(): elem_start += 1 # advance the element end past any element text elem_end = elem_start if elem.text is not None: for char in elem.text: if not char.isspace(): while text[elem_end:elem_end + 1].isspace(): elem_end += 1 assert text[elem_end] == char elem_end += 1 # advance the element end past any child elements for child_elem in elem: elem_end = align_helper(child_elem, elem_end) # advance the start for the next element past the tail text next_start = elem_end if elem.tail is not None: for char in elem.tail: if not char.isspace(): while text[next_start:next_start + 1].isspace(): next_start += 1 assert text[next_start] == char next_start += 1 # add the element and its start and end to the result list result.append((elem, elem_start, elem_end)) # return the start of the next element return next_start result = [] align_helper(tree, 0) return result Thanks, STeVe -- http://mail.python.org/mailman/listinfo/python-list