[lxml] Re: a simple question about a tricky problem

Holger.Joukl Fri, 16 Jun 2023 02:26:00 -0700

Hi Martin,

did you solve your problem by now?


> I use lxml to work with a large collection of TEI-encoded texts(66,000) that 
> are linguistically annotated.  Each token is wrapped in a <w> or <pc> element 
> with a unique ID and various attributes.
> I can march through the texts at the lowest level of <w> and <pc> elements 
> without paying any attention to the discursive structure of higher elements. 
> I just do
>
>            for  w in tree.iter(tei + ‘w’, tei + ‘pc’:
>             if x:
>                do this
>            if y:
>                do that
>
> But now I want to create a concordance in which tokens meeting some condition 
> are pulled out and surrounded with seven words on either side.  I do this 
> with itersiblings(), but that is a tricky operation.
> The next <w> token may not be a sibling but a child of a higher level 
> sibling.  Remembering that “elements are lists” you have patterns like
>
>            [a, b, c, [d, e, f] g, h, i, [k, l, m, n]
>

Would that be s.th. like the found <w> word element sits in a relative clause 
which is a nested clause in a sentence/main clause,
and you'd like to find surrounding words crossing the relative clause's end, 
i.e. also from the higher level?

Sorry if I'm not using proper terminology here - not a native English speaker, 
my grammar knowledge is rusty and I'm not familiar with
the TEI structure.

Could you maybe give a very small + simple ("minimal") example of such a 
TEI-encoded text that illustrates the problem?
Should be a working TEI-XML file, i.e. parsable with lxml.

> Getting from ‘c’ to ‘d’ is one thing, getting from ‘f’ to ‘g’ is another. In 
> a large archive of sometimes quite weird encodings, the details become very 
> hairy very fast. Is there are some “Gordian knot” solution,
> or does one just figure out this obstacle race one detail at a time? There 
> are “soft” tags that do not  break the continuity of a sentence (hi), hard 
> tags that mark an end beyond which you don’t want to go
> anyhow (p), and “jump tags” (note) where your “next sibling” is the first <w> 
> after the <note> element, which may be quite long.

Are there some higher level structures which boundaries'  your search for the 
surrounding 7 words should never cross?
Then maybe you could instead iterate over these higher level structures, 
"flatten" all its <w> and <pc> (or any other relevant) descendants,
ignoring <note>s, for further processing and then work on such a flattened list.

Best regards,
Holger






Landesbank Baden-Wuerttemberg
Anstalt des oeffentlichen Rechts
Hauptsitze: Stuttgart, Karlsruhe, Mannheim, Mainz
HRA 12704
Amtsgericht Stuttgart
HRA 4356, HRA 104 440
Amtsgericht Mannheim
HRA 40687
Amtsgericht Mainz

Die LBBW verarbeitet gemaess Erfordernissen der DSGVO Ihre personenbezogenen 
Daten.
Informationen finden Sie unter https://www.lbbw.de/datenschutz.
_______________________________________________
lxml - The Python XML Toolkit mailing list -- lxml@python.org
To unsubscribe send an email to lxml-le...@python.org
https://mail.python.org/mailman3/lists/lxml.python.org/
Member address: arch...@mail-archive.com

[lxml] Re: a simple question about a tricky problem

Reply via email to