On Thu, 2023-06-08 at 04:02 +0000, Martin Mueller wrote:

> Each token is wrapped in a <w> or <pc> element with a unique ID and
> various attributes. I can march through the texts at the lowest level
> of <w> and <pc> elements without paying any attention to the
> discursive structure of higher elements.
[snip]
> There are “soft” tags that do not break the continuity of a sentence
> (hi), hard tags that mark an end beyond which you don’t want to go
> anyhow (p), and “jump tags” (note) where your “next sibling” is the
> first <w> after the <note> element, which may be quite long.

I would approach this by first transforming each document into a
simpler structure, using XSLT. If you do not care about anything other
than tei:p, tei:w, and tei:sc elements, and for all of the latter two
to be children of the former, then your transform can go find all tei:p
(and any other containing elements you might have) and output them, and
then all descendant tei:w and tei:sc, as children.

Something like:

<xsl:template match="/">
  <doc>
    <xsl:apply-templates select="//tei:p"/>
  </doc>
</xsl:template>

<xsl:template match="tei:p">
  <p>
    <xsl:apply-templates select=".//tei:w | .//tei:sc"/>
  </p>
</xsl:template>

<xsl:template match="tei:sc | tei:w">
  <xsl:copy>
    <!-- Whatever handling of attributes and children and content you
want. -->
  </xsl:copy>
</xsl:template>

Following that, you can find the preceding and following siblings that
don't cross boundaries very easily.

Jamie
_______________________________________________
lxml - The Python XML Toolkit mailing list -- lxml@python.org
To unsubscribe send an email to lxml-le...@python.org
https://mail.python.org/mailman3/lists/lxml.python.org/
Member address: arch...@mail-archive.com

Reply via email to