On Thu, 2023-06-08 at 04:02 +0000, Martin Mueller wrote: > Each token is wrapped in a <w> or <pc> element with a unique ID and > various attributes. I can march through the texts at the lowest level > of <w> and <pc> elements without paying any attention to the > discursive structure of higher elements. [snip] > There are “soft” tags that do not break the continuity of a sentence > (hi), hard tags that mark an end beyond which you don’t want to go > anyhow (p), and “jump tags” (note) where your “next sibling” is the > first <w> after the <note> element, which may be quite long.
I would approach this by first transforming each document into a simpler structure, using XSLT. If you do not care about anything other than tei:p, tei:w, and tei:sc elements, and for all of the latter two to be children of the former, then your transform can go find all tei:p (and any other containing elements you might have) and output them, and then all descendant tei:w and tei:sc, as children. Something like: <xsl:template match="/"> <doc> <xsl:apply-templates select="//tei:p"/> </doc> </xsl:template> <xsl:template match="tei:p"> <p> <xsl:apply-templates select=".//tei:w | .//tei:sc"/> </p> </xsl:template> <xsl:template match="tei:sc | tei:w"> <xsl:copy> <!-- Whatever handling of attributes and children and content you want. --> </xsl:copy> </xsl:template> Following that, you can find the preceding and following siblings that don't cross boundaries very easily. Jamie _______________________________________________ lxml - The Python XML Toolkit mailing list -- lxml@python.org To unsubscribe send an email to lxml-le...@python.org https://mail.python.org/mailman3/lists/lxml.python.org/ Member address: arch...@mail-archive.com