Martin Mueller wrote at 2023-6-8 04:02 +0000:
>I use lxml to work with a large collection of TEI-encoded texts(66,000) that 
>are linguistically annotated.  Each token is wrapped in a <w> or <pc> element 
>with a unique ID and various attributes. I can march through the texts at the 
>lowest level of <w> and <pc> elements without paying any attention to the 
>discursive structure of higher elements. I just do
>
>            for  w in tree.iter(tei + �w�, tei + �pc�:
>             if x:
>                do this
>            if y:
>                do that
>
>But now I want to create a concordance in which tokens meeting some condition 
>are pulled out and surrounded with seven words on either side.  I do this with 
>itersiblings(), but that is a tricky operation. The next <w> token may not be 
>a sibling but a child of a higher level sibling.  Remembering that �elements 
>are lists� you have patterns like
>
>            [a, b, c, [d, e, f] g, h, i, [k, l, m, n]

Apparently, the sequence of `w` and `pc` elements (in document order)
is essential. You already have a solution to determine this sequence.

If you have any element, you can determine its `parent`
and therefore (recursively) the path to the element.
If you have elements `e1` and `e2`, you can then determine
the deepest common ancestor. Maybe, that helps you to solve your problem.
_______________________________________________
lxml - The Python XML Toolkit mailing list -- lxml@python.org
To unsubscribe send an email to lxml-le...@python.org
https://mail.python.org/mailman3/lists/lxml.python.org/
Member address: arch...@mail-archive.com

Reply via email to