Hi Martin, did you solve your problem by now?
> I use lxml to work with a large collection of TEI-encoded texts(66,000) that > are linguistically annotated. Each token is wrapped in a <w> or <pc> element > with a unique ID and various attributes. > I can march through the texts at the lowest level of <w> and <pc> elements > without paying any attention to the discursive structure of higher elements. > I just do > > for w in tree.iter(tei + ‘w’, tei + ‘pc’: > if x: > do this > if y: > do that > > But now I want to create a concordance in which tokens meeting some condition > are pulled out and surrounded with seven words on either side. I do this > with itersiblings(), but that is a tricky operation. > The next <w> token may not be a sibling but a child of a higher level > sibling. Remembering that “elements are lists” you have patterns like > > [a, b, c, [d, e, f] g, h, i, [k, l, m, n] > Would that be s.th. like the found <w> word element sits in a relative clause which is a nested clause in a sentence/main clause, and you'd like to find surrounding words crossing the relative clause's end, i.e. also from the higher level? Sorry if I'm not using proper terminology here - not a native English speaker, my grammar knowledge is rusty and I'm not familiar with the TEI structure. Could you maybe give a very small + simple ("minimal") example of such a TEI-encoded text that illustrates the problem? Should be a working TEI-XML file, i.e. parsable with lxml. > Getting from ‘c’ to ‘d’ is one thing, getting from ‘f’ to ‘g’ is another. In > a large archive of sometimes quite weird encodings, the details become very > hairy very fast. Is there are some “Gordian knot” solution, > or does one just figure out this obstacle race one detail at a time? There > are “soft” tags that do not break the continuity of a sentence (hi), hard > tags that mark an end beyond which you don’t want to go > anyhow (p), and “jump tags” (note) where your “next sibling” is the first <w> > after the <note> element, which may be quite long. Are there some higher level structures which boundaries' your search for the surrounding 7 words should never cross? Then maybe you could instead iterate over these higher level structures, "flatten" all its <w> and <pc> (or any other relevant) descendants, ignoring <note>s, for further processing and then work on such a flattened list. Best regards, Holger Landesbank Baden-Wuerttemberg Anstalt des oeffentlichen Rechts Hauptsitze: Stuttgart, Karlsruhe, Mannheim, Mainz HRA 12704 Amtsgericht Stuttgart HRA 4356, HRA 104 440 Amtsgericht Mannheim HRA 40687 Amtsgericht Mainz Die LBBW verarbeitet gemaess Erfordernissen der DSGVO Ihre personenbezogenen Daten. Informationen finden Sie unter https://www.lbbw.de/datenschutz. _______________________________________________ lxml - The Python XML Toolkit mailing list -- lxml@python.org To unsubscribe send an email to lxml-le...@python.org https://mail.python.org/mailman3/lists/lxml.python.org/ Member address: arch...@mail-archive.com