I use lxml to work with a large collection of TEI-encoded texts(66,000) that 
are linguistically annotated.  Each token is wrapped in a <w> or <pc> element 
with a unique ID and various attributes. I can march through the texts at the 
lowest level of <w> and <pc> elements without paying any attention to the 
discursive structure of higher elements. I just do

            for  w in tree.iter(tei + ‘w’, tei + ‘pc’:
             if x:
                do this
            if y:
                do that

But now I want to create a concordance in which tokens meeting some condition 
are pulled out and surrounded with seven words on either side.  I do this with 
itersiblings(), but that is a tricky operation. The next <w> token may not be a 
sibling but a child of a higher level sibling.  Remembering that “elements are 
lists” you have patterns like

            [a, b, c, [d, e, f] g, h, i, [k, l, m, n]

Getting from ‘c’ to ‘d’ is one thing, getting from ‘f’ to ‘g’ is another. In a 
large archive of sometimes quite weird encodings, the details become very hairy 
very fast. Is there are some “Gordian knot” solution, or does one just figure 
out this obstacle race one detail at a time? There are “soft” tags that do not 
break the continuity of a sentence (hi), hard tags that mark an end beyond 
which you don’t want to go anyhow (p), and “jump tags” (note) where your “next 
sibling” is the first <w> after the <note> element, which may be quite long.

I am old enough to have grown up with Winnie the Poh and feel like “Bear of 
Very Little Brain” when confronted with these problems. I’ll be grateful for 
any advice, including a confirmation that it’s the just way it is.

Martin Mueller
Professor of English and Classics emeritus
_______________________________________________
lxml - The Python XML Toolkit mailing list -- lxml@python.org
To unsubscribe send an email to lxml-le...@python.org
https://mail.python.org/mailman3/lists/lxml.python.org/
Member address: arch...@mail-archive.com

Reply via email to